2022
DOI: 10.48550/arxiv.2203.07720
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval

Abstract: Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language tasks. Despite the impressive results, VLP research becomes extremely expensive with the need for massive data and a long training time, preventing further explorations. In this work, we revitalize region features of sparsely sampled video clips to significantly reduce both spatial and temporal visual redundancy… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 43 publications
(131 reference statements)
0
2
0
Order By: Relevance
“…Different metaverse users may have different tasks that require AI to assist, making it challenging to customize AI model and training data for each use case. A promising solution is to develop foundation deep learning models [96], [176]- [178] that are first trained on large-scale data and then can be applied to various downstream tasks with minimal task-specific fine-tuning. As a starting point, the recent work EgoVLP [179] conducted visual-language pretraining on Ego4D [180], a massive egocentric video dataset that targets collecting what smart glasses see throughout human daily activities, and then the pretrained EgoVLP model demonstrates strong performances on various metaverse applications such as action recognition, moment query via a textual description, object state change detection.…”
Section: J Ar/vr Data Streaming and Learningmentioning
confidence: 99%
“…Different metaverse users may have different tasks that require AI to assist, making it challenging to customize AI model and training data for each use case. A promising solution is to develop foundation deep learning models [96], [176]- [178] that are first trained on large-scale data and then can be applied to various downstream tasks with minimal task-specific fine-tuning. As a starting point, the recent work EgoVLP [179] conducted visual-language pretraining on Ego4D [180], a massive egocentric video dataset that targets collecting what smart glasses see throughout human daily activities, and then the pretrained EgoVLP model demonstrates strong performances on various metaverse applications such as action recognition, moment query via a textual description, object state change detection.…”
Section: J Ar/vr Data Streaming and Learningmentioning
confidence: 99%
“…VideoBERT (Sun et al 2019b), ActBert (Zhu and Yang 2020), DECEMBERT (Tang, Lei, and Bansal 2021), and VIOLET (Fu et al 2021) pre-train matching tasks using the special token [CLS] for binary classification (Ruan and Jin 2022) with a cross-modal encoder (Vaswani et al 2017). Some methods (Zellers et al 2021;Ge et al 2022;Miech et al 2019Miech et al , 2020Ging et al 2020;Wang et al 2022b;Yang, Bisk, and Gao 2021;Yan et al 2021;Luo et al 2021;Patrick et al 2020;Cai et al 2022;Li et al 2020;Xu et al 2021b;Cai et al 2022;Cao et al 2022) pre-train matching tasks with two-stream encoders by forcing the paired samples closer while pushing different ones away (Ruan and Jin 2022). The others (Luo et al 2020;Li et al 2022) combine cross-modal Transformer matching tasks and twostream encoders matching tasks for more vital learning ability.…”
Section: Related Workmentioning
confidence: 99%