EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding

Tan, Shuhan; Nagarajan, Tushar; Grauman, Kristen

doi:10.48550/arxiv.2301.02217

Cited by 2 publications

(3 citation statements)

References 55 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There is a wide range of research on egocentric videos, covering topics such as human-object interactions [23], activity recognition [24][25][26], anticipation [27], video summarization [28,29], hand detection [30], parsing social interactions [31], and inferring the camera wearer's body pose [32]. Most of these works aim to evaluate behaviors over extended temporal durations.…”

Section: Egocentric Video Researchmentioning

confidence: 99%

See 1 more Smart Citation

Salient object detection in egocentric videos

Zhang,

Liang,

Zhao

et al. 2024

IET Image Processing

View full text Add to dashboard Cite

In the realm of video salient object detection (VSOD), the majority of research has traditionally been centered on third‐person perspective videos. However, this focus overlooks the unique requirements of certain first‐person tasks, such as autonomous driving or robot vision. To bridge this gap, a novel dataset and a camera‐based VSOD model, CaMSD, specifically designed for egocentric videos, is introduced. First, the SalEgo dataset, comprising 17,400 fully annotated frames for video salient object detection, is presented. Second, a computational model that incorporates a camera movement module is proposed, designed to emulate the patterns observed when humans view videos. Additionally, to achieve precise segmentation of a single salient object during switches between salient objects, as opposed to simultaneously segmenting two objects, a saliency enhancement module based on the Squeeze and Excitation Block is incorporated. Experimental results show that the approach outperforms other state‐of‐the‐art methods in egocentric video salient object detection tasks. Dataset and codes can be found at https://github.com/hzhang1999/SalEgo.

show abstract

Section: Egocentric Video Researchmentioning

confidence: 99%

“…For video images with unknown camera parameters, we crop with default parameters. Some studies [24] focusing on egocentric videos have already shown the effectiveness of camera movement for action recognition. And in Figure 4, we show the different types of correlation between camera movement and salient movement.…”

Section: Camera Movement Modulementioning

confidence: 99%

Salient object detection in egocentric videos

Zhang,

Liang,

Zhao

et al. 2024

IET Image Processing

View full text Add to dashboard Cite

show abstract

“…Multimodal (egocentric) video understanding. In the context of (egocentric) video understanding, several works have shown that using additional modalities at inference time significantly improves performance [25,29,33,36,43,50,56,61]. The hypothesis is intuitive -certain actions are more easily understood from specific modalities, e.g.…”

Section: Related Workmentioning

confidence: 99%

Multimodal Distillation for Egocentric Action Recognition

Radevski,

Grujicic,

Blaschko

et al. 2023

2023 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

The focal point of egocentric video understanding is modelling hand-object interactions. Standard models, e.g. CNNs or Vision Transformers, which receive RGB frames as input perform well, however, their performance improves further by employing additional input modalities (e.g. object detections, optical flow, audio, etc.) which provide cues complementary to the RGB modality. The added complexity of the modality-specific modules, on the other hand, makes these models impractical for deployment. The goal of this work is to retain the performance of such a multimodal approach, while using only the RGB frames as input at inference time. We demonstrate that for egocentric action recognition on the Epic-Kitchens and the Something-Something datasets, students which are taught by multimodal teachers tend to be more accurate and better calibrated than architecturally equivalent models trained on ground truth labels in a unimodal or multimodal fashion. We further adopt a principled multimodal knowledge distillation framework, allowing us to deal with issues which occur when applying multimodal knowledge distillation in a naïve manner. Lastly, we demonstrate the achieved reduction in computational complexity, and show that our approach maintains higher performance with the reduction of the number of input views. We release our code at: https://github.com/gorjanradevski/multimodal-distillation

show abstract

EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding

Cited by 2 publications

References 55 publications

Salient object detection in egocentric videos

Salient object detection in egocentric videos

Multimodal Distillation for Egocentric Action Recognition

Contact Info

Product

Resources

About