Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer 2021
DOI: 10.18653/v1/2021.acl-long.43
|View full text |Cite
|
Sign up to set email alerts
|

Learning Relation Alignment for Calibrated Cross-modal Retrieval

Abstract: Despite the achievements of large-scale multimodal pre-training approaches, cross-modal retrieval, e.g., image-text retrieval, remains a challenging task. To bridge the semantic gap between the two modalities, previous studies mainly focus on word-region alignment at the object level, lacking the matching between the linguistic relation among the words and the visual relation among the regions. The neglect of such relation consistency impairs the contextualized representation of image-text pairs and hinders th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 9 publications
(2 citation statements)
references
References 21 publications
0
2
0
Order By: Relevance
“…Video-Language Pre-trained Models. Benefitting from large-scale video-text datasets (Bain et al, 2021;Xue et al, 2021) and advances in Transformer model design (Gorti et al, 2022;Ren et al, 2021;Zellers et al, 2021;Wang et al, 2022a), pre-trained Video-Language Models (VidLMs) (Chen et al, 2022; have demonstrated impressive performance in video-language understanding tasks. VidLMs typically comprise a video encoder and a text encoder, which encode video-text pairs into a shared feature space to learn the semantic alignment between video and language.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Video-Language Pre-trained Models. Benefitting from large-scale video-text datasets (Bain et al, 2021;Xue et al, 2021) and advances in Transformer model design (Gorti et al, 2022;Ren et al, 2021;Zellers et al, 2021;Wang et al, 2022a), pre-trained Video-Language Models (VidLMs) (Chen et al, 2022; have demonstrated impressive performance in video-language understanding tasks. VidLMs typically comprise a video encoder and a text encoder, which encode video-text pairs into a shared feature space to learn the semantic alignment between video and language.…”
Section: Related Workmentioning
confidence: 99%
“…( 2)-( 4)) on video-text pairs. Considering TESTA can aggregate tokens into objects, scenes, events, etc., training with fine-grained alignment functions (Ren et al, 2021;Wang et al, 2022c) could help some tasks like action localization and video object detection (Zhukov et al, 2019;Real et al, 2017), on which we will perform more explorations in future work.…”
Section: Limitationsmentioning
confidence: 99%