Proceedings of the 27th ACM International Conference on Multimedia 2019
DOI: 10.1145/3343031.3350875
|View full text |Cite
|
Sign up to set email alerts
|

Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking

Abstract: A major challenge in matching images and text is that they have intrinsically different data distributions and feature representations. Most existing approaches are based either on embedding or classification, the first one mapping image and text instances into a common embedding space for distance measuring, and the second one regarding image-text matching as a binary classification problem. Neither of these approaches can, however, balance the matching accuracy and model complexity well. We propose a novel f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
56
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 125 publications
(56 citation statements)
references
References 38 publications
0
56
0
Order By: Relevance
“…Several recent methods for image-caption retrieval employ an object detector and a cross-modal attention. For example, SCO [16], SCAN [11], and MTFN [49] crop 10, 24, 36 regions, respectively, and merge them nonlinearly by using a cross-model attention network as introduced in Sect. 2.3.…”
Section: Comparison With State-of-the-art Modelsmentioning
confidence: 99%
See 1 more Smart Citation
“…Several recent methods for image-caption retrieval employ an object detector and a cross-modal attention. For example, SCO [16], SCAN [11], and MTFN [49] crop 10, 24, 36 regions, respectively, and merge them nonlinearly by using a cross-model attention network as introduced in Sect. 2.3.…”
Section: Comparison With State-of-the-art Modelsmentioning
confidence: 99%
“…† Note that MTFN[49] also proposed a re-ranking algorithm, which finds the best match between a set of queries and a set of targets (not between a single query and a set of targets). We omit this result because the problem setting is completely different from those of the other studies.…”
mentioning
confidence: 99%
“…Joint Embedding: Joint embedding models have shown excellent performance on several multimedia tasks, e.g., cross-modal retrieval [10,19,31,38,53,54,75,77,82,84], image captioning [35,49], image classification [23,25,32] video summarization [15,58], crossview matching [81]. Cross-modal retrieval methods require computing similarity between two different modalities, e.g., RGB and depth.…”
Section: Related Workmentioning
confidence: 99%
“…T ENSORS are a widely used data representation style for interaction data in the Machine Learning (ML) application community [1], e.g, in Recommendation Systems [2], Quality of Service (QoS) [3], Network Flow [4], Cyber-Physical-Social (CPS) [5], or Social Networks [6]. In addition to applications in which the data is naturally represented in the form of tensors, another common used case is the fusion in multi-view or multi-modality problems [7]. Here, during the learning process, each modality corresponds to a feature and the feature alignment involves fusion.…”
mentioning
confidence: 99%
“…Here, during the learning process, each modality corresponds to a feature and the feature alignment involves fusion. Tensors are a common form of feature fusion for multi-modal learning [7], [8], [9], [10]. Unfortunately, tensors can be difficult to process in practice.…”
mentioning
confidence: 99%