2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2023
DOI: 10.1109/wacv56688.2023.00108
|View full text |Cite
|
Sign up to set email alerts
|

Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval

Abstract: Image-sentence retrieval has attracted extensive research attention in multimedia and computer vision due to its promising application. The key issue lies in jointly learning the visual and textual representation to accurately estimate their similarity. To this end, the mainstream schema adopts an object-word based attention to calculate their relevance scores and refine their interactive representations with the attention features, which, however, neglects the context of the object representation on the inter… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
0
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 25 publications
(2 citation statements)
references
References 45 publications
0
0
0
Order By: Relevance
“…A multi-head self-attention mechanism is then employed to effectively capture spatial context information within the image, resulting in features being extracted for these individual patches. To ensure semantic alignment between visual and textual data, we follow a similar strategy as in prior research [28,29]; the patch characteristics are aligned onto a unified semantic framework through fully connected layers. As a result, the identified attributes for each patch may be depicted as V = {v i /i = 1, ..., m, v i ∈ Rd}, where m denotes the quantity of patches per image.…”
Section: Feature Extraction Componentmentioning
confidence: 99%
“…A multi-head self-attention mechanism is then employed to effectively capture spatial context information within the image, resulting in features being extracted for these individual patches. To ensure semantic alignment between visual and textual data, we follow a similar strategy as in prior research [28,29]; the patch characteristics are aligned onto a unified semantic framework through fully connected layers. As a result, the identified attributes for each patch may be depicted as V = {v i /i = 1, ..., m, v i ∈ Rd}, where m denotes the quantity of patches per image.…”
Section: Feature Extraction Componentmentioning
confidence: 99%
“…Scene graphs and multimodality. The structural representations of scene graphs has been explored in the context of different V&L tasks, such as image-text retrieval (Johnson et al, 2015;Schuster et al, 2015;Schroeder and Tripathi, 2020;Ge et al, 2023), image captioning (Yao et al, 2018;Yang et al, 2019), and visual QA (Qian et al, 2022;Koner et al, 2021;Shi et al, 2019).…”
Section: Related Workmentioning
confidence: 99%