2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00272
|View full text |Cite
|
Sign up to set email alerts
|

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Abstract: Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 milli… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
912
1
1

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 829 publications
(916 citation statements)
references
References 56 publications
2
912
1
1
Order By: Relevance
“…1) Setup for MSR-VTT: We follow the official data split, which divides MSR-VTT into three disjoint subsets used for training, validation and test, respectively. Note that in [34] and its follow-ups [16]- [18], a smaller test set of 1,000 videos randomly sampled from the full test set is used, which we refer to as test-1k.…”
Section: A Experimental Setupmentioning
confidence: 99%
See 1 more Smart Citation
“…1) Setup for MSR-VTT: We follow the official data split, which divides MSR-VTT into three disjoint subsets used for training, validation and test, respectively. Note that in [34] and its follow-ups [16]- [18], a smaller test set of 1,000 videos randomly sampled from the full test set is used, which we refer to as test-1k.…”
Section: A Experimental Setupmentioning
confidence: 99%
“…• Miech et al [18]: Use a 1D-CNN as its sentence encoder. • Dual Encoding [14]: Hierarchical encoding that combines BoW, bi-GRU and 1D-CNN.…”
Section: Experiments 3 Combined Loss Versus Single Lossmentioning
confidence: 99%
“…(2) Scale: Compared with the recent datasets for image classification (e.g., ImageNet [18] with 1 million images) and action detection (e.g., ActivityNet v1.3 [30] with 20k videos), most existing instructional video datasets are relatively smaller in scale. Though the HowTo100M dataset provided a great amount of data, its automaticly generated annotation might be inaccurate as the authors mentioned in [46]. The challenge of building such a large-scale dataset mainly stems from the difficulty to organize enormous amount of video and the heavy workload of annotation.…”
Section: Datasets Related To Instructional Video Analysismentioning
confidence: 99%
“…It learns a single output embedding which is the weighted similarity between the different implicit visual-text embeddings. Recently, Miech et al [23] propose the HowTo100M dataset: A large dataset collected automatically using generated captions from youtube of 'how to tasks'. They find that finetuning on these weakly-paired video clips allows for stateof-the-art performance on a number of different datasets.…”
Section: Related Workmentioning
confidence: 99%