2015 IEEE International Conference on Computer Vision (ICCV) 2015
DOI: 10.1109/iccv.2015.301
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Convolutional Neural Networks for Matching Image and Sentence

Abstract: In this paper, we propose multimodal convolutional neural networks (m-CNNs) for matching image and sentence. Our m-CNN provides an end-to-end framework with convolutional architectures to exploit image representation, word composition, and the matching relations between the two modalities. More specifically, it consists of one image CNN encoding the image content, and one matching CNN learning the joint representation of image and sentence. The matching CNN composes words to different semantic fragments and le… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
185
0

Year Published

2015
2015
2021
2021

Publication Types

Select...
4
4
2

Relationship

2
8

Authors

Journals

citations
Cited by 304 publications
(186 citation statements)
references
References 42 publications
1
185
0
Order By: Relevance
“…For NN approaches to contentbased image retrieval, see Wan et al (2014). Similarly, we exclude work on NN approaches to acoustic or multi-modal IR, such as mixing text with imagery (see Ma et al 2015bMa et al , 2016. We also intentionally focus on the current ''third wave'' revival of NN research, excluding earlier work.…”
mentioning
confidence: 99%
“…For NN approaches to contentbased image retrieval, see Wan et al (2014). Similarly, we exclude work on NN approaches to acoustic or multi-modal IR, such as mixing text with imagery (see Ma et al 2015bMa et al , 2016. We also intentionally focus on the current ''third wave'' revival of NN research, excluding earlier work.…”
mentioning
confidence: 99%
“…Also we plan to investigate more complicated and flexible harmonization functions to adapt to the complex content of multimodal data. Further, we will apply the proposed method to more complicated multimodal correlation learning tasks, e.g., image-text matching [67]. We will study an extension of our method to be able to deal with missing data, since real-world multimodal data usually are unpaired or incomplete.…”
Section: Resultsmentioning
confidence: 99%
“…Another similar method is Latent Dirichlet Allocation (LDA) [5,45] which attempts to model the correlation between multi-modal data. In recent years, deep architectures have also been conducted to learn the multimodal representation [43,8]. These models map the image representation and sentence representation into a common multi-modal embedding space.…”
Section: Related Workmentioning
confidence: 99%