2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01432
|View full text |Cite
|
Sign up to set email alerts
|

BEVT: BERT Pretraining of Video Transformers

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
42
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 139 publications
(42 citation statements)
references
References 33 publications
0
42
0
Order By: Relevance
“…Popular image-language models such as CLIP [83] and ALIGN [48] are trained on massive datasets by using web images and alt-text. Similarly, videolanguage models are catching up and can be categorised into two broad directions: (i) adapting image-language models for videos [8,22,49,50,62,65,71,108,110,119], and (ii) pure video-based models that are learned using large video-text datasets [3,7,[26][27][28]30,57,61,64,67,68,95,117]. Recently, a new paradigm of post-pretraining has emerged where an existing image-or video-language model goes through another stage of self-supervised pretraining on a small amount of video data before it is evaluated on downstream tasks [65,119].…”
Section: Foundational Video-language Modelsmentioning
confidence: 99%
“…Popular image-language models such as CLIP [83] and ALIGN [48] are trained on massive datasets by using web images and alt-text. Similarly, videolanguage models are catching up and can be categorised into two broad directions: (i) adapting image-language models for videos [8,22,49,50,62,65,71,108,110,119], and (ii) pure video-based models that are learned using large video-text datasets [3,7,[26][27][28]30,57,61,64,67,68,95,117]. Recently, a new paradigm of post-pretraining has emerged where an existing image-or video-language model goes through another stage of self-supervised pretraining on a small amount of video data before it is evaluated on downstream tasks [65,119].…”
Section: Foundational Video-language Modelsmentioning
confidence: 99%
“…and Bao et al (2022) show two different mask image modeling paradigms and both achieve state-of-the-art results. Because of the similarity between image and video, these two paradigms are also suitable for pre-training video transformers (Tong et al, 2022;Wang et al, 2021d). In Tong's work (Tong et al, 2022), the video volume is masked by some random tubes, and the training objective is to regress the RGB pixels located in the tubes.…”
Section: Related Workmentioning
confidence: 99%
“…In Tong's work (Tong et al, 2022), the video volume is masked by some random tubes, and the training objective is to regress the RGB pixels located in the tubes. In contrast, Wang et al (2021d) use a pre-trained VQ-VAE tokenizer (Ramesh et al, 2021) for both video and image modalities to generate discrete visual tokens, which aims to release the model from fitting short-range dependencies and high-frequency details. Though the VQ-VAE tokenizer provides the semantic level signal for each spatial-temporal patch separately, pre-training such a tokenizer requires an unbearable number of data and computation costs (Ramesh et al, 2021), which is inefficient.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Although they adopt GloVe [26] embeddings for query, the issues of feature gap are well alleviated. Considering recent advances in video-based vision-language pre-training (e.g., BVET [218], ActBERT [23], ClipBERT [219], and VideoCLIP [220]), dedicated or more effective feature extractors for TSGV are much expected.…”
Section: Chapter 8 Conclusion and Future Work 81 Conclusionmentioning
confidence: 99%