Proceedings of the 26th ACM International Conference on Multimedia 2018
DOI: 10.1145/3240508.3240712
|View full text |Cite
|
Sign up to set email alerts
|

Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval

Abstract: Cross-modal retrieval between visual data and natural language description remains a long-standing challenge in multimedia. While recent image-text retrieval methods offer great promise by learning deep representations aligned across modalities, most of these methods are plagued by the issue of training with small-scale datasets covering a limited number of images with ground-truth sentences. Moreover, it is extremely expensive to create a larger dataset by annotating millions of images with sentences and may … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
16
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 59 publications
(16 citation statements)
references
References 59 publications
0
16
0
Order By: Relevance
“…Nevertheless, without extra information, improvement is limited. Chowdhury et al [44] introduced additional web information to cross-modal retrieval.…”
Section: Deep Learning Methodsmentioning
confidence: 99%
“…Nevertheless, without extra information, improvement is limited. Chowdhury et al [44] introduced additional web information to cross-modal retrieval.…”
Section: Deep Learning Methodsmentioning
confidence: 99%
“…Numerous publications in recent years deal with multimodal information in retrieval tasks. The general problem of reduc-ing or bridging the semantic gap [44] between images and text is the main issue in cross-media retrieval [3,34,35,39,50]. Fan et al [8] tackle this problem by modeling humans' visual and descriptive senses with a multi-sensory fusion network.…”
Section: Multimedia Information Retrievalmentioning
confidence: 99%
“…CHAIN-VSE [16] builds a bidirectional retrieval framework which relies on character-level inception module for visualsemantic embeddings. Mithun et al [41] build a two-stage approach for image-text retrieval by using a supervised ranking loss combined with weakly-supervised web images to learn multimodal representation. However, these methods cannot well capture the fine-grained correlation between image and text.…”
Section: Related Workmentioning
confidence: 99%