A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval

Nguyen, Manh-Duy; Nguyen, Binh T.; Gurrin, Cathal

doi:10.48550/arxiv.2106.02400

Cited by 6 publications

(13 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Regarding to graph structures, SGM [35] introduced a visual graph encoder and a textual graph encoder to capture the interaction between objects appearing in images and between the entities in text. LGSGM [26] proposed a graph embedding network on top of SGM to learn both local and global information about the graphs. Similarly, GSMN [21] presented a novel technique to assess the correspondence of nodes and edges of graphs extracted from images and texts separately.…”

Section: Related Workmentioning

confidence: 99%

“…A graph neural network was employed to extract visual and textual embedded vectors from fused graph-based structures of images and texts, where we can measure their cosine similarity. To the best of our knowledge, the graph structure has been widely applied in the image-text retrieval challenge [26,7,27,35,21]. Nevertheless, it was utilized to capture the interaction between objects or align local and global information within images.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

HADA: A Graph-based Amalgamation Framework in Image-text Retrieval

Nguyen¹,

Nguyen²,

Gurrin³

2023

Preprint

View full text Add to dashboard Cite

Many models have been proposed for vision and language tasks, especially the image-text retrieval task. All state-of-the-art (SOTA) models in this challenge contained hundreds of millions of parameters. They also were pretrained on a large external dataset that has been proven to make a big improvement in overall performance. It is not easy to propose a new model with a novel architecture and intensively train it on a massive dataset with many GPUs to surpass many SOTA models, which are already available to use on the Internet. In this paper, we proposed a compact graph-based framework, named HADA, which can combine pretrained models to produce a better result, rather than building from scratch. First, we created a graph structure in which the nodes were the features extracted from the pretrained models and the edges connecting them. The graph structure was employed to capture and fuse the information from every pretrained model with each other. Then a graph neural network was applied to update the connection between the nodes to get the representative embedding vector for an image and text. Finally, we used the cosine similarity to match images with their relevant texts and vice versa to ensure a low inference time. Our experiments showed that, although HADA contained a tiny number of trainable parameters, it could increase baseline performance by more than 3.6% in terms of evaluation metrics in the Flickr30k dataset. Additionally, the proposed model did not train on any external dataset and did not require many GPUs but only 1 to train due to its small number of parameters. The source code is available at https://github.com/m2man/HADA.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

HADA: A Graph-based Amalgamation Framework in Image-text Retrieval

Nguyen¹,

Nguyen²,

Gurrin³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…This was also the approach of Exquisitor [16] in LSC'21. Due to an increase in the performance of embedding models in the image retrieval field [19,23,26], many lifelog retrieval systems are now applying this approach [2,3,20,37]. Memento [2] and Voxento [3] were two of the teams that achieved high performance in LSC'21.…”

Section: Related Workmentioning

confidence: 99%

E-Myscéal: Embedding-based Interactive Lifelog Retrieval System for LSC'22

Tran

Nguyen

et al. 2022

Proceedings of the 5th Annual on Lifelog Search Challenge

Self Cite

View full text Add to dashboard Cite

Developing interactive lifelog retrieval systems is a growing research area. There are many international competitions for lifelog retrieval that encourage researchers to build effective systems that can address the multimodal retrieval challenge of lifelogs. The Lifelog Search Challenge (LSC) was first organised in 2018 and is currently the only interactive benchmarking evaluation for lifelog retrieval systems. Participating systems should have an accurate search engine and a user-friendly interface that can help users to retrieve relevant content. In this paper, we upgrade our previous Myscéal, which was the top performing system in LSC'20 and LSC'21, and present E-Myscéal for LSC'22, which includes a completely different search engine. Instead of using visual concepts for retrieval such as Myscéal, the new E-Myscéal employs an embedding technique that facilitates novice users who are not familiar with the concepts. Our experiments show that the new search engine can find relevant images in the first place in the ranked list, four a quarter of the LSC'21 queries (26%) by using just the first hint from the textual information need. Regarding the user interface, we still keep the simple non-faceted design as in the previous version but improve the event view browsing in order to better support novice users. CCS CONCEPTS• Information systems → Information retrieval; • Humancentered computing → Human computer interaction (HCI); User interface design.

show abstract

“…A graph neural network was employed to extract visual and textual embedded vectors from fused graph-based structures of images and texts, where we can measure their cosine similarity. To the best of our knowledge, the graph structure has been widely applied in the image-text retrieval challenge [7,21,26,27,35]. Nevertheless, it was utilized to capture the interaction between objects or align local and global information within images.…”

Section: Introductionmentioning

confidence: 99%

HADA: A Graph-Based Amalgamation Framework in Image-text Retrieval

Nguyen

Gurrin

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Many models have been proposed for vision and language tasks, especially the image-text retrieval task. State-of-the-art (SOTA) models in this challenge contain hundreds of millions of parameters. They also were pretrained on large external datasets that have been proven to significantly improve overall performance. However, it is not easy to propose a new model with a novel architecture and intensively train it on a massive dataset with many GPUs to surpass many SOTA models already available to use on the Internet. In this paper, we propose a compact graph-based framework named HADA, which can combine pretrained models to produce a better result rather than starting from scratch. Firstly, we created a graph structure in which the nodes were the features extracted from the pretrained models and the edges connecting them. The graph structure was employed to capture and fuse the information from every pretrained model. Then a graph neural network was applied to update the connection between the nodes to get the representative embedding vector for an image and text. Finally, we employed cosine similarity to match images with their relevant texts and vice versa to ensure a low inference time. Our experiments show that, although HADA contained a tiny number of trainable parameters, it could increase baseline performance by more than 3.6% in terms of evaluation metrics on the Flickr30k dataset. Additionally, the proposed model did not train on any external dataset and only required a single GPU to train due to the small number of parameters required. The source code is available at https://github.com/m2man/HADA.

show abstract

A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval

Cited by 6 publications

References 19 publications

HADA: A Graph-based Amalgamation Framework in Image-text Retrieval

HADA: A Graph-based Amalgamation Framework in Image-text Retrieval

E-Myscéal: Embedding-based Interactive Lifelog Retrieval System for LSC'22

HADA: A Graph-Based Amalgamation Framework in Image-text Retrieval

Contact Info

Product

Resources

About