Cross-Modal Hybrid Feature Fusion for Image-Sentence Matching

XuXing,; Wang, Yifan; HeYixuan,; Yang, Yang; Hanjalic, Alan; Tao, ShenHeng

doi:10.1145/3458281

Cited by 38 publications

(6 citation statements)

References 57 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The image-text cross-modal retrieval task is designed to explore the correspondence between image and text. The existing matching methods can be roughly divided into two categories: graph-free paradigm [6,7,[10][11][12][13][14][18][19][20][21][22][23][24][25][26][27] and graph-based paradigm [8,9,[15][16][17][29][30][31][32][33].…”

Section: Image-text Cross-modal Retrievalmentioning

confidence: 99%

“…While these methods show promising results in image-text cross-modal retrieval, they mainly embed global feature representations and overlook the fine-grained semantic associations between image and text. To tackle this limitation, recent research concentrates on learning correspondences between image regions and text words, achieving semantic coverage from coarse-to-fine [11][12][13][14]. For instance, Xu et al [11] propose a cross-modal hybrid feature fusion method to capture interactions between image and text, which learns image-text similarity by fusing feature representation of intra-and inter-modality, providing robust semantic interactions between image regions and text words.…”

Section: Image-text Cross-modal Retrievalmentioning

confidence: 99%

“…To tackle this limitation, recent research concentrates on learning correspondences between image regions and text words, achieving semantic coverage from coarse-to-fine [11][12][13][14]. For instance, Xu et al [11] propose a cross-modal hybrid feature fusion method to capture interactions between image and text, which learns image-text similarity by fusing feature representation of intra-and inter-modality, providing robust semantic interactions between image regions and text words. Another work by Lan et al [13] proposes a multi-level matching network model that incorporates multi-level similarity between image and text via adaptive matching integration strategies.…”

Section: Image-text Cross-modal Retrievalmentioning

confidence: 99%

“…However, these methods only capture the rough semantic correlation between different modalities and fail to describe the local semantic correspondence between image regions and text words effectively. To address this limitation, fine-grained cross-modal retrieval methods [3,8,9,[11][12][13][14][15][16][17] have been proposed for modelling the local similarity between image regions and text words. Currently, fine-grained image-text cross-modal retrieval methods can be roughly divided into two categories: (1) Graph-free paradigm [6,7,[10][11][12][13][14][18][19][20][21][22][23][24][25][26][27]: These methods typically encode multi-level feature representations using the output of the last layer of the encoder and then fuse these multi-level similarities to obtain the final cross-modal similarity.…”

Section: Introductionmentioning

confidence: 99%

“…To address this limitation, fine-grained cross-modal retrieval methods [3,8,9,[11][12][13][14][15][16][17] have been proposed for modelling the local similarity between image regions and text words. Currently, fine-grained image-text cross-modal retrieval methods can be roughly divided into two categories: (1) Graph-free paradigm [6,7,[10][11][12][13][14][18][19][20][21][22][23][24][25][26][27]: These methods typically encode multi-level feature representations using the output of the last layer of the encoder and then fuse these multi-level similarities to obtain the final cross-modal similarity. Additionally, region-level features from visual target detectors (e.g., Faster R-CNN [28]) are employed to establish semantic alignment between image regions and text words.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding

Zeng,

Ma,

et al. 2024

Electronics

View full text Add to dashboard Cite

Image–text cross-modal retrieval aims to bridge the semantic gap between different modalities, allowing for the search of images based on textual descriptions or vice versa. Existing efforts in this field concentrate on coarse-grained feature representation and then utilize pairwise ranking loss to pull image–text positive pairs closer, pushing negative ones apart. However, using pairwise ranking loss directly on coarse-grained representation lacks reliability as it disregards fine-grained information, posing a challenge in narrowing the semantic gap between image and text. To this end, we propose an Instance Contrastive Embedding (IConE) method for image–text cross-modal retrieval. Specifically, we first transfer the multi-modal pre-training model to the cross-modal retrieval task to leverage the interactive information between image and text, thereby enhancing the model’s representational capabilities. Then, to comprehensively consider the feature distribution of intra- and inter-modality, we design a novel two-stage training strategy that combines instance loss and contrastive loss, dedicated to extracting fine-grained representation within instances and bridging the semantic gap between modalities. Extensive experiments on two public benchmark datasets, Flickr30k and MS-COCO, demonstrate that our IConE outperforms several state-of-the-art (SoTA) baseline methods and achieves competitive performance.

show abstract

Section: Image-text Cross-modal Retrievalmentioning

confidence: 99%

Section: Image-text Cross-modal Retrievalmentioning

confidence: 99%

Section: Image-text Cross-modal Retrievalmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding

Zeng,

Ma,

et al. 2024

Electronics

View full text Add to dashboard Cite

show abstract

Predictive Information Preservation via Variational Information Bottleneck for Cross-View Geo-Localization

Li¹,

Hu²

2022

Communications in Computer and Information Science

View full text Add to dashboard Cite

Deep Convolutional Neural Network Compression Method: Tensor Ring Decomposition with Variational Bayesian Approach

Liu,

Zhang,

Shi

et al. 2024

Neural Process Lett

View full text Add to dashboard Cite

Due to deep neural networks (DNNs) a large number of parameters, DNNs increase the demand for computing and storage during training, reasoning and deployment, especially when DNNs stack deeper and wider. Tensor decomposition can not only compress DNN models but also reduce parameters and storage requirements while maintaining high accuracy and performance. About tensor ring (TR) decomposition of tensor decomposition, there are two problems: (1) The practice of setting the TR rank to be equal in TR decomposition results in an unreasonable rank configuration. (2) The training time of selecting rank through iterative processes is time-consuming. To address the two problems, a TR network compression method by Variational Bayesian (TR-VB) is proposed based on the Global Analytic Solution of Empirical Variational Bayesian Matrix Factorization (GAS of EVBMF). The method consists of three steps: (1) rank selection, (2) TR decomposition, and (3) fine-tuning to recover accumulated loss of accuracy. Experimental results show that, for a given network, TR-VB gives the best results in terms of Top-1 accuracy, parameters, and training time under different compression levels. Furthermore, TR-VB validated on CIFAR-10/100 public benchmarks achieves state-of-the-art performance.

show abstract

Cross-Modal Hybrid Feature Fusion for Image-Sentence Matching

Cited by 38 publications

References 57 publications

Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding

Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding

Predictive Information Preservation via Variational Information Bottleneck for Cross-View Geo-Localization

Deep Convolutional Neural Network Compression Method: Tensor Ring Decomposition with Variational Bayesian Approach

Contact Info

Product

Resources

About