BEVT: BERT Pretraining of Video Transformers

Wang, Rui; Chen, Dongdong; Wu, Zuxuan; Chen, Yinpeng; Dai, Xiyang; Liu, Mengchen; Jiang, Yu‐Gang; Zhou, Luowei; Yuan, Lu

doi:10.1109/cvpr52688.2022.01432

Cited by 139 publications

(42 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Popular image-language models such as CLIP [83] and ALIGN [48] are trained on massive datasets by using web images and alt-text. Similarly, videolanguage models are catching up and can be categorised into two broad directions: (i) adapting image-language models for videos [8,22,49,50,62,65,71,108,110,119], and (ii) pure video-based models that are learned using large video-text datasets [3,7,[26][27][28]30,57,61,64,67,68,95,117]. Recently, a new paradigm of post-pretraining has emerged where an existing image-or video-language model goes through another stage of self-supervised pretraining on a small amount of video data before it is evaluated on downstream tasks [65,119].…”

Section: Foundational Video-language Modelsmentioning

confidence: 99%

Test of Time: Instilling Video-Language Models with a Sense of Time

Bagad¹,

Tapaswi²,

Snoek³

2023

Preprint

View full text Add to dashboard Cite

Modeling and understanding time remains a challenge in contemporary video understanding models. With language emerging as a key driver towards powerful generalization, it is imperative for foundational video-language models to have a sense of time. In this paper, we consider a specific aspect of temporal understanding: consistency of time order as elicited by before/after relations. We establish that six existing video-language models struggle to understand even such simple temporal relations. We then question whether it is feasible to equip these foundational models with temporal awareness without re-training them from scratch. Towards this, we propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data. We conduct a zero-shot evaluation of the adapted models on six datasets for three downstream tasks which require a varying degree of time awareness. We observe encouraging performance gains especially when the task needs higher time awareness. Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without the need for data and compute-intense training from scratch.

show abstract

Section: Foundational Video-language Modelsmentioning

confidence: 99%

Test of Time: Instilling Video-Language Models with a Sense of Time

Bagad¹,

Tapaswi²,

Snoek³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…and Bao et al (2022) show two different mask image modeling paradigms and both achieve state-of-the-art results. Because of the similarity between image and video, these two paradigms are also suitable for pre-training video transformers (Tong et al, 2022;Wang et al, 2021d). In Tong's work (Tong et al, 2022), the video volume is masked by some random tubes, and the training objective is to regress the RGB pixels located in the tubes.…”

Section: Related Workmentioning

confidence: 99%

“…In Tong's work (Tong et al, 2022), the video volume is masked by some random tubes, and the training objective is to regress the RGB pixels located in the tubes. In contrast, Wang et al (2021d) use a pre-trained VQ-VAE tokenizer (Ramesh et al, 2021) for both video and image modalities to generate discrete visual tokens, which aims to release the model from fitting short-range dependencies and high-frequency details. Though the VQ-VAE tokenizer provides the semantic level signal for each spatial-temporal patch separately, pre-training such a tokenizer requires an unbearable number of data and computation costs (Ramesh et al, 2021), which is inefficient.…”

Section: Related Workmentioning

confidence: 99%

“…All of these targets represent static appearance information in the image. Based on these successes, some researchers (Wang et al, 2021d;Tong et al, 2022) attempt to extend the mask-and-predict task to the video domain, where they mask 3D video regions and reconstruct appearance information as done in the image domain. However, simply using the appearance reconstruction target for learning video representation suffers from two limitations.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Masked Motion Encoding for Self-Supervised Video Representation Learning

Sun¹,

Chen²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

We study self-supervised video representation learning that seeks to learn video features from unlabeled videos, which is widely used for video analysis as labeling videos is labor-intensive. Current methods often mask some video regions and then train a model to reconstruct spatial information in these regions (e.g., original pixels). However, the model is easy to reconstruct this information by considering content in a single frame. As a result, it may neglect to learn the interactions between frames, which are critical for video analysis. In this paper, we present a new self-supervised learning task, called Masked Motion Modeling (M 3 Video), for learning representation by enforcing the model to predict the motion of moving objects in the masked regions. To generate motion targets for this task, we track the objects using optical flow. The motion targets consist of position transitions and shape changes of the tracked objects, thus the model has to consider multiple frames comprehensively. Besides, to help the model capture fine-grained motion details, we enforce the model to predict trajectory motion targets in high temporal resolution based on a video in low temporal resolution. After pre-training using our M 3 Video task, the model is able to anticipate fine-grained motion details even taking a sparsely sampled video as input. We conduct extensive experiments on four benchmark datasets. Remarkably, when doing pre-training with 400 epochs, we improve the accuracy from 67.6% to 69.2% and from 78.8% to 79.7% on Something-Something V2 and Kinetics-400 datasets, respectively.

show abstract

“…Although they adopt GloVe [26] embeddings for query, the issues of feature gap are well alleviated. Considering recent advances in video-based vision-language pre-training (e.g., BVET [218], ActBERT [23], ClipBERT [219], and VideoCLIP [220]), dedicated or more effective feature extractors for TSGV are much expected.…”

Section: Chapter 8 Conclusion and Future Work 81 Conclusionmentioning

confidence: 99%

Towards temporal sentence grounding in videos

Zhang¹

View full text Add to dashboard Cite

show abstract

BEVT: BERT Pretraining of Video Transformers

Cited by 139 publications

References 33 publications

Test of Time: Instilling Video-Language Models with a Sense of Time

Test of Time: Instilling Video-Language Models with a Sense of Time

Masked Motion Encoding for Self-Supervised Video Representation Learning

Towards temporal sentence grounding in videos

Contact Info

Product

Resources

About