2021
DOI: 10.48550/arxiv.2111.01024
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition

Abstract: In egocentric videos, actions occur in quick succession. We capitalise on the action's temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance. To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities, with an explicit language model providing action sequence context to enhance the predictions. We test our approach on EPIC-KITCHENS and EGTEA datasets reporting stat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
8
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(8 citation statements)
references
References 49 publications
0
8
0
Order By: Relevance
“…In egocentric videos, temporal context might be even more informative. For instance, at being unedited and continuous, actions unfold, with a more often than not, predictable sequence [20,21,24]. Beyond visual cues, we argue that context from the audio stream also provide priors to better localize actions.…”
Section: Introductionmentioning
confidence: 89%
See 2 more Smart Citations
“…In egocentric videos, temporal context might be even more informative. For instance, at being unedited and continuous, actions unfold, with a more often than not, predictable sequence [20,21,24]. Beyond visual cues, we argue that context from the audio stream also provide priors to better localize actions.…”
Section: Introductionmentioning
confidence: 89%
“…Deep learning facilitates audiovisual learning as it enables learning per-modality hierarchical representations [37], which are more optimal than designing hand-crafted features. Recent works provide us with more sophisticated solutions where the learned modality representations are being fused implicitly by the network and are optimized for the downstream task, such as [1,17,24,25,32,42,44]. While several works discussed the audiovisual scenario for the action recognition task [44], incorporating audio for TAL is not a widely researched area.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The complex nature of egocentric videos raises a variety of challenges, such as ego-motion [18], partially visible or occluded objects, and environmental bias [19], [20], [21], [22], which limit the performance of traditional approaches when used in FPAR [23], [24]. Those challenges attract the community's interest and motivate the design of novel and more complex architectures, often based on multi-stream approaches such as [25], [15], [26], [27], [16].…”
Section: Related Work First Person Action Recognition (Fpar)mentioning
confidence: 99%
“…The complex nature of egocentric videos raises a variety of challenges, such as egomotion [60], partially visible or occluded objects, and environmental bias [53,72,77,84,88], which limit the performance of traditional, third-person, approaches when used in first person action recognition (FPAR) [14,15]. The community's interest has quickly grown [16,17,19,83] in recent years, thanks to the possibilities that these data open for the evaluation and understanding of human behavior, leading to the design of novel architectures [30,51,52,91,104]. While the use of optical flow has been the de-facto procedure [14][15][16][17]19,41] in FPAR, the interest has recently shifted towards more lightweight alternatives, such as gaze [27,59,71], audio [9,52,77], depth [32], skeleton [32], and inertial measurements [41], to enable motion modeling in online settings.…”
Section: Related Workmentioning
confidence: 99%