Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021
DOI: 10.18653/v1/2021.emnlp-main.544
|View full text |Cite
|
Sign up to set email alerts
|

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

Abstract: We present VideoCLIP, a contrastive approach to pre-train a unified model for zeroshot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
131
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 229 publications
(132 citation statements)
references
References 49 publications
1
131
0
Order By: Relevance
“…Although they adopt GloVe [36] embeddings for query, the issues of feature gap are well alleviated. Considering recent advances in video-based vision-language pretraining (e.g., BVET [168], ActBERT [169], ClipBERT [170], and VideoCLIP [171]), dedicated or more effective feature extractors for TSGV are much expected.…”
Section: Effective Feature Extractor(s)mentioning
confidence: 99%
“…Although they adopt GloVe [36] embeddings for query, the issues of feature gap are well alleviated. Considering recent advances in video-based vision-language pretraining (e.g., BVET [168], ActBERT [169], ClipBERT [170], and VideoCLIP [171]), dedicated or more effective feature extractors for TSGV are much expected.…”
Section: Effective Feature Extractor(s)mentioning
confidence: 99%
“…Video-and-language Pre-training. Apart from the canonical pre-training tasks, such as masked language modeling (MLM) [10,26,30,35,48,57] and video-text matching (VTM) [30,35], several methods [35,37,52] apply contrastive learning on offline extracted visual features. Without adapting the visual backbone, their ability to align cross-modal features remain limited.…”
Section: Related Workmentioning
confidence: 99%
“…Existing sparse video-language pre-training models use either dot-product [3,37,39,52] or rely entirely on a transformer encoder [26,30,48,57] to model cross-modal interactions. However, since video and text features reside in different embedding spaces, such methods lead to less satisfactory alignment.…”
Section: Contrastive Video-text Alignmentmentioning
confidence: 99%
See 2 more Smart Citations