2021
DOI: 10.48550/arxiv.2112.09583
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Abstract: Video-and-language pre-training has shown promising improvements on various downstream tasks. Most previous methods capture cross-modal interactions with a standard transformer-based multimodal encoder, not fully addressing the misalignment between unimodal video and text features. Besides, learning fine-grained visual-language alignment usually requires off-the-shelf object detectors to provide object information, which is bottlenecked by the detector's limited vocabulary and expensive computation cost.In thi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
10
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(10 citation statements)
references
References 32 publications
(108 reference statements)
0
10
0
Order By: Relevance
“…Recently, with the successful migration of Transformer [43] from natural language processing to computer vision, (ViT [5], Swin-Transformer [24], etc. ), the mainstream methods [6,9,16,17,19,22,26,35,44,45] on video-language retrieval tasks begin to utilize Transformer as encoders for both of the video and the natural language. HERO [17] and ALPRO [16] explore to boost the video-text alignment via the large-scale pre-training tasks.…”
Section: Text-based Video Retrievalmentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, with the successful migration of Transformer [43] from natural language processing to computer vision, (ViT [5], Swin-Transformer [24], etc. ), the mainstream methods [6,9,16,17,19,22,26,35,44,45] on video-language retrieval tasks begin to utilize Transformer as encoders for both of the video and the natural language. HERO [17] and ALPRO [16] explore to boost the video-text alignment via the large-scale pre-training tasks.…”
Section: Text-based Video Retrievalmentioning
confidence: 99%
“…), the mainstream methods [6,9,16,17,19,22,26,35,44,45] on video-language retrieval tasks begin to utilize Transformer as encoders for both of the video and the natural language. HERO [17] and ALPRO [16] explore to boost the video-text alignment via the large-scale pre-training tasks. HiT [22] further proposed a hierarchical model with momentum contrast for videotext retrieval.…”
Section: Text-based Video Retrievalmentioning
confidence: 99%
“…Video-language pre-training (VidL) aims to learn generalizable multi-modal models from largescale video-text samples so as to better solve various challenging video-language understanding tasks, such as text-video retrieval [28,4,1,43] and video question answering [12,36,40]. Recent studies [16,17,7,9,44,47] have shown that VidL leads to significant performance improvement and achieves state-of-the-art results on various downstream text-video retrieval and video question answering (VQA) benchmarks.…”
Section: Introductionmentioning
confidence: 99%
“…Building a unified model capable of solving various video-language tasks is a long-standing challenge for machine learning research. A few recent works [7,17] attempt to learn a unified VidL model for both tasks, which uses the multi-modal encoder to conduct text-video retrieval. However, the model requires an exhaustive pair-wise comparison between the query texts and gallery videos.…”
Section: Introductionmentioning
confidence: 99%
“…However, a natural semantic gap between two modalities, i.e., video and text, raises a great challenge, which hinders industrial-level applications of TVR. To this end, recent methods target to distill cross-modal knowledge from largescale pretraining experts [19,12,17,28] and leverage cross-modal contrastive learning to explore both intra-modal representation and cross-modal interaction.…”
Section: Introductionmentioning
confidence: 99%