Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking

Wang, Tan; Xu, Xing; Yang, Yang; Hanjalic, Alan; Shen, Heng Tao; Song, Jingkuan

doi:10.1145/3343031.3350875

Cited by 125 publications

(56 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several recent methods for image-caption retrieval employ an object detector and a cross-modal attention. For example, SCO [16], SCAN [11], and MTFN [49] crop 10, 24, 36 regions, respectively, and merge them nonlinearly by using a cross-model attention network as introduced in Sect. 2.3.…”

Section: Comparison With State-of-the-art Modelsmentioning

confidence: 99%

See 1 more Smart Citation

Target-Oriented Deformation of Visual-Semantic Embedding Space

Matsubara

2021

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

Section: Comparison With State-of-the-art Modelsmentioning

confidence: 99%

“…† Note that MTFN[49] also proposed a re-ranking algorithm, which finds the best match between a set of queries and a set of targets (not between a single query and a set of targets). We omit this result because the problem setting is completely different from those of the other studies.…”

mentioning

confidence: 99%

Target-Oriented Deformation of Visual-Semantic Embedding Space

Matsubara

2021

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

“…Joint Embedding: Joint embedding models have shown excellent performance on several multimedia tasks, e.g., cross-modal retrieval [10,19,31,38,53,54,75,77,82,84], image captioning [35,49], image classification [23,25,32] video summarization [15,58], crossview matching [81]. Cross-modal retrieval methods require computing similarity between two different modalities, e.g., RGB and depth.…”

Section: Related Workmentioning

confidence: 99%

Rgb2lidar

Mithun

Sikka

Chiu

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

We study an important, yet largely unexplored problem of largescale cross-modal visual localization by matching ground RGB images to a geo-referenced aerial LIDAR 3D point cloud (rendered as depth images). Prior works were demonstrated on small datasets and did not lend themselves to scaling up for large-scale applications. To enable large-scale evaluation, we introduce a new dataset containing over 550 pairs (covering 143 2 area) of RGB and aerial LIDAR depth images. We propose a novel joint embedding based method that effectively combines the appearance and semantic cues from both modalities to handle drastic cross-modal variations. Experiments on the proposed dataset show that our model achieves a strong result of a median rank of 5 in matching across a large test set of 50 location pairs collected from a 14 2 area. This represents a significant advancement over prior works in performance and scale. We conclude with qualitative results to highlight the challenging nature of this task and the benefits of the proposed model. Our work provides a foundation for further research in cross-modal visual localization. CCS CONCEPTS • Information systems → Multimedia and multimodal retrieval; • Computing methodologies → Matching.

show abstract

“…T ENSORS are a widely used data representation style for interaction data in the Machine Learning (ML) application community [1], e.g, in Recommendation Systems [2], Quality of Service (QoS) [3], Network Flow [4], Cyber-Physical-Social (CPS) [5], or Social Networks [6]. In addition to applications in which the data is naturally represented in the form of tensors, another common used case is the fusion in multi-view or multi-modality problems [7]. Here, during the learning process, each modality corresponds to a feature and the feature alignment involves fusion.…”

mentioning

confidence: 99%

“…Here, during the learning process, each modality corresponds to a feature and the feature alignment involves fusion. Tensors are a common form of feature fusion for multi-modal learning [7], [8], [9], [10]. Unfortunately, tensors can be difficult to process in practice.…”

mentioning

confidence: 99%

SGD_Tucker: A Novel Stochastic Optimization Strategy for Scalable Parallel Sparse Tucker Decomposition

Rellermeyer

et al. 2021

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Sparse Tucker Decomposition (STD) algorithms learn a core tensor and a group of factor matrices to obtain an optimal low-rank representation feature for the High-Order, High-Dimension, and Sparse Tensor (HOHDST). However, existing STD algorithms face the problem of intermediate variables explosion which results from the fact that the formation of those variables, i.e., matrices Khatri-Rao product, Kronecker product, and matrix-matrix multiplication, follows the whole elements in sparse tensor. The above problems prevent deep fusion of efficient computation and big data platforms. To overcome the bottleneck, a novel stochastic optimization strategy (SGD Tucker) is proposed for STD which can automatically divide the high-dimension intermediate variables into small batches of intermediate matrices. Specifically, SGD Tucker only follows the randomly selected small samples rather than the whole elements, while maintaining the overall accuracy and convergence rate. In practice, SGD Tucker features the two distinct advancements over the state of the art. First, SGD Tucker can prune the communication overhead for the core tensor in distributed settings. Second, the low data-dependence of SGD Tucker enables fine-grained parallelization, which makes SGD Tucker obtaining lower computational overheads with the same accuracy. Experimental results show that SGD Tucker runs at least 2X faster than the state of the art.

show abstract

Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking

Cited by 125 publications

References 38 publications

Target-Oriented Deformation of Visual-Semantic Embedding Space

Target-Oriented Deformation of Visual-Semantic Embedding Space

Rgb2lidar

SGD_Tucker: A Novel Stochastic Optimization Strategy for Scalable Parallel Sparse Tucker Decomposition

Contact Info

Product

Resources

About