2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016
DOI: 10.1109/cvpr.2016.9
|View full text |Cite
|
Sign up to set email alerts
|

Generation and Comprehension of Unambiguous Object Descriptions

Abstract: We propose a method that can generate an unambiguous description (known as a referring expression) of a specific object or region in an image, and which can also comprehend or interpret such an expression to infer which object is being described. We show that our method outperforms previous methods that generate descriptions of objects without taking into account other potentially ambiguous objects in the scene. Our model is inspired by recent successes of deep learning methods for image captioning, but while … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

4
1,017
0
1

Year Published

2017
2017
2019
2019

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 960 publications
(1,022 citation statements)
references
References 44 publications
4
1,017
0
1
Order By: Relevance
“…We are inspired by recent progress in object retrieval [15,14,26]. Both Hu et al [15,14] and Mao et al [26] present a recurrent neural network able to localize an object in an image by means of a natural language query only, either returning a bounding box [15] or a free-form segment [14].…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…We are inspired by recent progress in object retrieval [15,14,26]. Both Hu et al [15,14] and Mao et al [26] present a recurrent neural network able to localize an object in an image by means of a natural language query only, either returning a bounding box [15] or a free-form segment [14].…”
Section: Introductionmentioning
confidence: 99%
“…Both Hu et al [15,14] and Mao et al [26] present a recurrent neural network able to localize an object in an image by means of a natural language query only, either returning a bounding box [15] or a free-form segment [14]. To cope with language ambiguity, Mao et al introduce referring expressions that uniquely describe an object in an image.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Sentence semantics only provides ambiguous and implicit labels. This resembles another line of work that learns structured output from image captions (Berg et al 2004;Gupta and Davis 2008;Luo et al 2009;Jamieson et al 2010a, b;Plummer et al 2015;Mao et al 2016), treating the input as a parallel image-text dataset. However, all of these methods, except Gupta and Davis (2008) and Jamieson et al (2010a, b) use pretrained object models learned from other datasets.…”
Section: Related Workmentioning
confidence: 99%
“…These features are then associated in a learning process with certain words, resulting in an association of colour features with colour words, spatial features with prepositions, etc., and based on this, these words can be interpreted with reference to the scene currently presented to the video feed. Whereas Roy's work still looked at relatively simple scenes with graphical objects, research on REG has recently started to investigate set-ups based on real-world images (Kazemzadeh et al, 2014;Gkatzia et al, 2015;Zarrieß and Schlangen, 2016;Mao et al, 2015). Importantly, the lowlevel visual features that can be extracted from these scenes correspond less directly to particular word classes.…”
Section: Related Workmentioning
confidence: 99%