2018
DOI: 10.1007/978-3-030-01228-1_38
|View full text |Cite
|
Sign up to set email alerts
|

In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video

Abstract: We address the task of jointly determining what a person is doing and where they are looking based on the analysis of video captured by a headworn camera. To facilitate our research, we first introduce the EGTEA Gaze+ dataset. Our dataset comes with videos, gaze tracking data, hand masks and action annotations, thereby providing the most comprehensive benchmark for First Person Vision (FPV). Moving beyond the dataset, we propose a novel deep model for joint gaze estimation and action recognition in FPV. Our me… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
329
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 244 publications
(330 citation statements)
references
References 78 publications
1
329
0
Order By: Relevance
“…Although the type of information and ground truth annotations made available by the authors is heterogeneous, it is possible to identify some sub-areas that are more recurrent than others. The vast majority of datasets provided hand segmentation masks [2], [127], [128], [16], [22], [135], [46], [154], [47], [49], reflecting the high number of approaches proposed in this area (Section 3). However, the high number of datasets is counterbalanced by a relative low number of annotated frames, usually in the order of a few hundreds or thousands of images.…”
Section: Fpv Datasets With Hand Annotationmentioning
confidence: 99%
See 1 more Smart Citation
“…Although the type of information and ground truth annotations made available by the authors is heterogeneous, it is possible to identify some sub-areas that are more recurrent than others. The vast majority of datasets provided hand segmentation masks [2], [127], [128], [16], [22], [135], [46], [154], [47], [49], reflecting the high number of approaches proposed in this area (Section 3). However, the high number of datasets is counterbalanced by a relative low number of annotated frames, usually in the order of a few hundreds or thousands of images.…”
Section: Fpv Datasets With Hand Annotationmentioning
confidence: 99%
“…To expedite the lenghty pixel-level annotation process and build larger datasets for hand segmentation, some authors proposed semi-automated techniques, for example based on Grabcut [135], [43]. Actions/activities [150], [67], [135], [98], [154], [47] and hand gestures [127], [128], [107], [94], [109], [152], [126] are other common information that were captured and annotated in many datasets. This large amount of data has been used by researchers for developing robust HCI applications that relied on hand gestures.…”
Section: Fpv Datasets With Hand Annotationmentioning
confidence: 99%
“…EGTEA Gaze++ is a recently collected dataset with approximately 10K samples of 106 activity classes. We use the first split as [14], which contains 8299 training and 2022 testing instances.…”
Section: Egocentric Video Datasetsmentioning
confidence: 99%
“…A way to circumvent this is to leverage rich human activity datasets that contain task-labeled images and videos of humans manipulating objects [5]- [7]. The caveat is that these datasets are often 2D and they lack annotations that could facilitate the inference of a 6D grasp pose which is Mia Toyota Research Institute ("TRI") provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.…”
Section: Introductionmentioning
confidence: 99%