2021
DOI: 10.48550/arxiv.2105.09996
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
19
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 12 publications
(19 citation statements)
references
References 25 publications
0
19
0
Order By: Relevance
“…35.4 [145] (6K) 138.7 [142] (10K) 36.7 [138] (46K) 75.2 [87] (123K) 54.7 [147] (20K) 25.2 [139] (38K) 75.4 [60] (9K) -…”
Section: Training Details For the Flamingo Modelsmentioning
confidence: 99%
“…35.4 [145] (6K) 138.7 [142] (10K) 36.7 [138] (46K) 75.2 [87] (123K) 54.7 [147] (20K) 25.2 [139] (38K) 75.4 [60] (9K) -…”
Section: Training Details For the Flamingo Modelsmentioning
confidence: 99%
“…Our pre-trained model achieves higher performance with lower computation cost. Finally, some work [27,28,30,50,55] adopts a joint encoder to concatenate videos and texts as inputs, thus every text-video pair needs to be fed into the encoder during inference, resulting in low efficiency for retrieval. By comparison, our model adopts the efficient "dual-encoder" architecture with only a video encoder and a text encoder for inference.…”
Section: Methodsmentioning
confidence: 99%
“…Although these methods are efficient for video-text retrieval, they ignore local semantics and fine-grained alignment between modalities. Methods in the second category [27,28,30,47,50,55] adopt the "jointencoder" architectures to interact cross-modality local features through concatenating videos and texts as inputs with a binary classifier to predict whether videos and texts are aligned or not. Despite they can build local associations between videos and texts, they sacrifice the retrieval efficiency since every text-video pair needs to be fed into the encoder during inference.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…An instructional or how-to video contains a human subject demonstrating and narrating how to accomplish a certain task. Early works on HowTo100M have focused on leveraging this large collection for learning models that can be transferred to other tasks, such as action recognition [4,37,38], video captioning [24,36,66], or text-video retrieval [7,37,61]. The problem of recognizing the task performed in the instructional video has been considered by Bertasius et al [8].…”
Section: Related Workmentioning
confidence: 99%