2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.00490
|View full text |Cite
|
Sign up to set email alerts
|

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
542
0
2

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 338 publications
(544 citation statements)
references
References 29 publications
0
542
0
2
Order By: Relevance
“…Vision-language pretraining (VLP) aims to learn joint visual-textual representations for a variety of multimodal downstream tasks. Existing works either learn unimodal encoders by distinguishing the positive pair(s) from the unpaired samples [3,28,41] or focus on one multimodal encoder for joint feature learning with masked image/language modeling and image-text matching losses [12,27,29]. Additionally, some approaches seek fine-grained supervision for cross-modal interaction [20,30,[54][55][56].…”
Section: Related Workmentioning
confidence: 99%
“…Vision-language pretraining (VLP) aims to learn joint visual-textual representations for a variety of multimodal downstream tasks. Existing works either learn unimodal encoders by distinguishing the positive pair(s) from the unpaired samples [3,28,41] or focus on one multimodal encoder for joint feature learning with masked image/language modeling and image-text matching losses [12,27,29]. Additionally, some approaches seek fine-grained supervision for cross-modal interaction [20,30,[54][55][56].…”
Section: Related Workmentioning
confidence: 99%
“…LAVIS currently supports 4 foundation models, i.e. AL-BEF [34], BLIP [33], CLIP [44] and ALPRO [32].…”
Section: Supported Tasks Datasets and Modelsmentioning
confidence: 99%
“…(iii) State-ofthe-art and reproducible language-vision models. The li-brary enables access to over 30 pre-trained and task-specific fine-tuned model checkpoints of four foundation models: ALBEF [34], BLIP [33], CLIP [44] and ALPRO [32]. These models achieve competitive performance across multiple tasks evaluated using common metrics.…”
Section: Introductionmentioning
confidence: 99%
“…To alleviate the temporal misalignment issue, it incorporates an entropy minimization-based constrained attention loss to encourage the model to automatically focus on the correct captions from a pool of candidate ASR captions. Li et al (2022d) propose a new visually-grounded pre-training task, prompting entity modeling (PEM), to learn fine-grained region-entity alignment. The prediction targets for the PEM task are generated by an entity prompter module, trained with contrastive learning to produce the similarity between a video crop and text prompts instantiated with entity names.…”
Section: Advanced Pre-training Tasksmentioning
confidence: 99%