2019
DOI: 10.48550/arxiv.1912.06430
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Antoine Miech,
Jean-Baptiste Alayrac,
Lucas Smaira
et al.

Abstract: Timeyou have a little pressure you are cutting the wood readjusting the table saw I am using a roller sure you applied glue Figure 1: We describe an efficient approach to learn visual representations from highly misaligned and noisy narrations automatically extracted from instructional videos. Our video representations are learnt from scratch without relying on any manually annotated visual dataset yet outperform all self-supervised and many fully-supervised methods on several video recognition benchmarks.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
26
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
6
1

Relationship

1
6

Authors

Journals

citations
Cited by 13 publications
(26 citation statements)
references
References 63 publications
0
26
0
Order By: Relevance
“…Recent works focus on leveraging the imagelevel annotations (as weak supervision) [8,9] or unsupervised method [64] to learn the association across language descriptions and objects. Proceeding from this, there arise works on using uncurated captions to learn temporal associations across video segments and texts [27,51]. Notably, these works all highlight on the importance of constructing contrastive pairs in exploiting the weak annotations and inspire our work.…”
Section: Related Workmentioning
confidence: 91%
See 1 more Smart Citation
“…Recent works focus on leveraging the imagelevel annotations (as weak supervision) [8,9] or unsupervised method [64] to learn the association across language descriptions and objects. Proceeding from this, there arise works on using uncurated captions to learn temporal associations across video segments and texts [27,51]. Notably, these works all highlight on the importance of constructing contrastive pairs in exploiting the weak annotations and inspire our work.…”
Section: Related Workmentioning
confidence: 91%
“…Through extensive experiments on two benchmarks, we show the WSRA model outperforms the state-of-the-art weakly-supervised methods by a notable gain, achieving on par or even better performance than some fully-supervised methods. As an outlook for the future study, we consider that the most potential aspect our model would benefit comes from the video-language representation learning at scale [28], whereas the training is often severely accompanied by uncurated annotations: e.g., temporally misaligned descriptions [27]. To construct a soundly and largely pre-trained model, it is requisite to properly leverage the weak or biased annotation at comprehensive views.…”
Section: Conclusion and Broader Impactmentioning
confidence: 99%
“…In this work, we explore a different pretext task, which models the consistency between videos from the same action instance but with different visual tempos. There are also works that learn video representations using not only videos themselves but also corresponding text [37,38,31] and audios [27,2,1,35]. In contrast to those works, we learn compact video representations from RGB frames only.…”
Section: Related Workmentioning
confidence: 99%
“…Language Y and vision (1) where L is a metric-learning loss between text and video embeddings [34]. The parameters f , g, and h define the embedding functions of the language X , language Y, and the video domain Z, respectively.…”
Section: Unsupervised Multilingual Learningmentioning
confidence: 99%
“…where N is a set of negative pairs used to enforce that video and narration that co-occur in the data are close in the space and those that do not are far. In this work, the negatives are x and z paired with other x and z chosen uniformly at random from the training set X , following [34]. In practice, each training batch includes clips from either language, and the negatives for each element in the NCE loss are the other elements from the batch in the same language.…”
Section: The Base Model: Training and Inferencementioning
confidence: 99%