Proceedings of the 29th ACM International Conference on Multimedia 2021
DOI: 10.1145/3474085.3475637
|View full text |Cite
|
Sign up to set email alerts
|

Dense Contrastive Visual-Linguistic Pretraining

Abstract: Inspired by the success of BERT, several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. In particular, LXMERT and UNITER adopt visual region feature regression and label classification as pretext tasks. However, they tend to suffer from the problems of noisy labels and sparse semantic annotations, based on the visual feature… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
2
2
1

Relationship

0
5

Authors

Journals

citations
Cited by 10 publications
(1 citation statement)
references
References 55 publications
0
1
0
Order By: Relevance
“…Contrastive learning provides neural models with self-supervised competence using relevant and irrelevant pairs. It improves multimodal representations in pretraining handling noise and bias in the data [195].…”
Section: Bmentioning
confidence: 99%
“…Contrastive learning provides neural models with self-supervised competence using relevant and irrelevant pairs. It improves multimodal representations in pretraining handling noise and bias in the data [195].…”
Section: Bmentioning
confidence: 99%