Proceedings of the 29th ACM International Conference on Multimedia 2021
DOI: 10.1145/3474085.3475251
|View full text |Cite
|
Sign up to set email alerts
|

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Abstract: Vision-and-language pretraining (VLP) aims to learn generic multimodal representations from massive image-text pairs. While various successful attempts have been proposed, learning fine-grained semantic alignments between image-text pairs plays a key role in their approaches. Nevertheless, most existing VLP approaches have not fully utilized the intrinsic knowledge within the image-text pairs, which limits the effectiveness of the learned alignments and further restricts the performance of their models. To thi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
21
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 35 publications
(21 citation statements)
references
References 51 publications
0
21
0
Order By: Relevance
“…[Ji et al, 2019] adopted a visual saliency detection module to guide the cross-modal correlation. [Cui et al, 2021] integrated intra-and cross-modal knowledge to learn the image and text features jointly.…”
Section: Feature Extractionmentioning
confidence: 99%
See 2 more Smart Citations
“…[Ji et al, 2019] adopted a visual saliency detection module to guide the cross-modal correlation. [Cui et al, 2021] integrated intra-and cross-modal knowledge to learn the image and text features jointly.…”
Section: Feature Extractionmentioning
confidence: 99%
“…On the one hand, the intra-and cross-modal knowledge in the image and text data are fully exploited in the pre-training ITR approaches [Li et al, 2020c;Cui et al, 2021]. On the other hand, many studies concentrate on increasing the scale of pre-training data.…”
Section: Pre-training Image-text Retrievalmentioning
confidence: 99%
See 1 more Smart Citation
“…The past few years have witnessed the rapid development of Vision-Language Pre-training (VLP) models [2,4,17,39], and task-specific finetune on VLP models has become a new and state-of-the-art paradigm in many multimedia tasks [20,21,33]. Beyond accuracy, fairness which concerns about the discrimination towards socially protected or sensitive groups plays a critical role in trustworthy deployment of VLP models in downstream tasks.…”
Section: Introductionmentioning
confidence: 99%
“…It has become more and more unrealistic to artificially watch and process such a tremendous amount of video data. With the further demand for computers to automatically analyze, understand, and process video content, many video understanding problems [31][32][33] in deep learning and computer vision arise and thrive, such as video visual question answering [5,10,11,18,22] and language-guided video action localization [2,34]. Referring video object segmentation aims to selectively segment one specific object spatially and temporally in a video according to a language query.…”
Section: Introductionmentioning
confidence: 99%