2022
DOI: 10.48550/arxiv.2207.07885
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Clover: Towards A Unified Video-Language Alignment and Fusion Model

Abstract: Building a universal video-language model for solving various video understanding tasks (e.g., text-video retrieval, video question answering) is an open challenge to the machine learning field. Towards this goal, most recent attempts train the models, usually consisting of uni-modal and cross-modal feature encoders, with supervised or pair-wise contrastive pre-text tasks. Though offering attractive generality, the resulted models have to compromise between efficiency and performance. We argue the flaws are ca… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 37 publications
0
4
0
Order By: Relevance
“…Yiwei et al [24] presented the X-CLIP model, a multi-grain contrastive model for video-text retrieval. Jingjia et al [25] proposed the Clover approach to pretraining linked video language, which sets a new state-of-the-art on several downstream tasks. Haofan et al [26] suggested a video-text retrieval strategy for cross-modal representation learning, which aims to select the video from a pool of candidate videos that matches the text query.…”
Section: Text Regain Methodsmentioning
confidence: 99%
“…Yiwei et al [24] presented the X-CLIP model, a multi-grain contrastive model for video-text retrieval. Jingjia et al [25] proposed the Clover approach to pretraining linked video language, which sets a new state-of-the-art on several downstream tasks. Haofan et al [26] suggested a video-text retrieval strategy for cross-modal representation learning, which aims to select the video from a pool of candidate videos that matches the text query.…”
Section: Text Regain Methodsmentioning
confidence: 99%
“…However, it is hard to get a pretrained model as powerful as CLIP in the video domain due to the unaffordable demands on computation resources and the difficulty of collecting video-text data pairs as large and diverse as image-text data. Instead of directly pursuing video-text pretrained models [17,27], a potential alternative solution that benefits video downstream tasks is to transfer the knowledge in image-text pretrained models to the video domain, which has attracted increasing attention in recent years [12,13,26,29,30,41].…”
Section: Introductionmentioning
confidence: 99%
“…Text-based video retrieval tasks, including text-based video retrieval (Luo et al 2020;Zhu and Yang 2020;Li et al 2020) and text-based video corpus moment retrieval (Li et al 2020) have shown significant potential and alluring technological value. Thanks to the ability of cross-modality alignments, video-language pre-training shows effectiveness on these retrieval tasks (Wang et al 2022b;Huang et al 2022). Cross-modal alignment is the key challenge in learning video-language pre-training.…”
Section: Introductionmentioning
confidence: 99%