2021
DOI: 10.48550/arxiv.2106.02400
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval

Abstract: Conventional approaches to image-text retrieval mainly focus on indexing visual objects appearing in pictures but ignore the interactions between these objects. Such objects occurrences and interactions are equivalently useful and important in this field as they are usually mentioned in the text. Scene graph presentation is a suitable method for the image-text matching challenge and obtained good results due to its ability to capture the inter-relationship information. Both images and text are represented in s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
13
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
1
1

Relationship

1
5

Authors

Journals

citations
Cited by 6 publications
(13 citation statements)
references
References 19 publications
0
13
0
Order By: Relevance
“…Regarding to graph structures, SGM [35] introduced a visual graph encoder and a textual graph encoder to capture the interaction between objects appearing in images and between the entities in text. LGSGM [26] proposed a graph embedding network on top of SGM to learn both local and global information about the graphs. Similarly, GSMN [21] presented a novel technique to assess the correspondence of nodes and edges of graphs extracted from images and texts separately.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Regarding to graph structures, SGM [35] introduced a visual graph encoder and a textual graph encoder to capture the interaction between objects appearing in images and between the entities in text. LGSGM [26] proposed a graph embedding network on top of SGM to learn both local and global information about the graphs. Similarly, GSMN [21] presented a novel technique to assess the correspondence of nodes and edges of graphs extracted from images and texts separately.…”
Section: Related Workmentioning
confidence: 99%
“…A graph neural network was employed to extract visual and textual embedded vectors from fused graph-based structures of images and texts, where we can measure their cosine similarity. To the best of our knowledge, the graph structure has been widely applied in the image-text retrieval challenge [26,7,27,35,21]. Nevertheless, it was utilized to capture the interaction between objects or align local and global information within images.…”
Section: Introductionmentioning
confidence: 99%
“…This was also the approach of Exquisitor [16] in LSC'21. Due to an increase in the performance of embedding models in the image retrieval field [19,23,26], many lifelog retrieval systems are now applying this approach [2,3,20,37]. Memento [2] and Voxento [3] were two of the teams that achieved high performance in LSC'21.…”
Section: Related Workmentioning
confidence: 99%
“…A graph neural network was employed to extract visual and textual embedded vectors from fused graph-based structures of images and texts, where we can measure their cosine similarity. To the best of our knowledge, the graph structure has been widely applied in the image-text retrieval challenge [7,21,26,27,35]. Nevertheless, it was utilized to capture the interaction between objects or align local and global information within images.…”
Section: Introductionmentioning
confidence: 99%