2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.00660
|View full text |Cite
|
Sign up to set email alerts
|

Composing Text and Image for Image Retrieval - an Empirical Odyssey

Abstract: In this paper, we study the task of image retrieval, where the input query is specified in the form of an image plus some text that describes desired modifications to the input image. For example, we may present an image of the Eiffel tower, and ask the system to find images which are visually similar, but are modified in small ways, such as being taken at nighttime instead of during the day. To tackle this task, we learn a similarity metric between a target image x j and a source image x i plus source text t … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
321
1

Year Published

2020
2020
2020
2020

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 285 publications
(323 citation statements)
references
References 49 publications
1
321
1
Order By: Relevance
“…There is a long research history on text and image matching. These researches have greatly promoted the developments of the cross-modal applications, such as cross-modal retrieval [40], image captioning [1] and visual question answering [2].…”
Section: Text and Image Matchingmentioning
confidence: 99%
See 1 more Smart Citation
“…There is a long research history on text and image matching. These researches have greatly promoted the developments of the cross-modal applications, such as cross-modal retrieval [40], image captioning [1] and visual question answering [2].…”
Section: Text and Image Matchingmentioning
confidence: 99%
“…Recently, it has been witnessed a soar increase of the research interest in cross-modal retrieval, which takes one type of data as the query and retrieves relevant data of another type. The pivot of cross-modal retrieval is to learn a meaningful cross-modal matching [40].…”
Section: Introductionmentioning
confidence: 99%
“…Interactive image search aims to incorporate user feedback as an interactive signal to navigate the visual search. In general, the user interaction can be given in various formats, including relative attribute [45,28,75], attribute [79,18,2], attribute-like modification text [66], natural language [16,17], spatial layout [37], and sketch [76,74,14]. As text is the most pervasive interaction between human and computer in contemporary search engines, it naturally serves to convey concrete information that elaborates user's intricate specification for image search.…”
Section: Related Workmentioning
confidence: 99%
“…Each image is tagged with descriptive texts as product description, such as "white logo print t-shirt", which is exploited as side information for auxiliary supervision via joint training. Following [66], we use the training split of around 172k images for training and the test set of 33,480 test queries for evaluation. During training, pairwise images with attributelike modification texts are generated by comparing their product descriptions (see Supplementary Material).…”
Section: Fashion200kmentioning
confidence: 99%
“…The ability to process information mimicking the visual attention mechanism is known as saliency detection in computer vision. As a preprocessing step, saliency detection promotes efficiency in a wide variety of vision-oriented multimedia applications, such as semantic segmentation [1]- [5], image quality assessment [6], [7], image and video compression [8], [9], image retargeting [10], [11], image classification [12]- [14], and image retrieval [15], [16]. Saliency detection can be divided into eye-fixation prediction [17]- [20], which predicts the focus of the human gaze, and salient object detection [21]- [25], which extracts the most salient objects or regions from a scene.…”
Section: Introductionmentioning
confidence: 99%