2017 IEEE International Conference on Computer Vision (ICCV) 2017
DOI: 10.1109/iccv.2017.314
|View full text |Cite
|
Sign up to set email alerts
|

Trespassing the Boundaries: Labeling Temporal Bounds for Object Interactions in Egocentric Video

Abstract: Manual annotations of temporal bounds for object interactions (i.e. start and end times) are typical training input to recognition, localization and detection algorithms. For three publicly available egocentric datasets, we uncover inconsistencies in ground truth temporal bounds within and across annotators and datasets. We systematically assess the robustness of state-of-the-art approaches to changes in labeled temporal bounds, for object interaction recognition. As boundaries are trespassed, a drop of up to … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

2
24
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
5
2
2

Relationship

3
6

Authors

Journals

citations
Cited by 28 publications
(26 citation statements)
references
References 33 publications
2
24
0
Order By: Relevance
“…For each narrated sentence, we adjust the start and end times of the action using AMT. To ensure the annotators are trained to perform temporal localisation, we The green participant's annotations are selected as the final annotations use a clip from our previous work's understanding [33] that explains temporal bounds of actions. Each HIT is composed of a maximum of 10 consecutive narrated phrases p i , where annotators label A i = [t si , t ei ] as the start and end times of the i th action.…”
Section: Action Segment Annotationsmentioning
confidence: 99%
“…For each narrated sentence, we adjust the start and end times of the action using AMT. To ensure the annotators are trained to perform temporal localisation, we The green participant's annotations are selected as the final annotations use a clip from our previous work's understanding [33] that explains temporal bounds of actions. Each HIT is composed of a maximum of 10 consecutive narrated phrases p i , where annotators label A i = [t si , t ei ] as the start and end times of the i th action.…”
Section: Action Segment Annotationsmentioning
confidence: 99%
“…Fusion in Egocentric AR: Late fusion of appearance and motion has been frequently used in egocentric AR [8,24,38,40], as well as extended to additional streams aimed at capturing egocentric cues [21,37,38]. In [21], the spatial stream segments hands and detects objects.…”
Section: Related Workmentioning
confidence: 99%
“…Increasingly, datasets have been used for novel tasks, through pre-training (He et al 2019;Mettes et al 2016), self-supervision (Noroozi and Favaro 2016;Vondrick et al 2018) or additional annotations (Gupta and Malik 2016;Heilbron et al 2018). However, task adaptation demonstrates that models overfit to the data and annotations (Zhai et al 2019;Moltisanti et al 2017).…”
Section: Introduction and Related Datasetsmentioning
confidence: 99%