Generation and Comprehension of Unambiguous Object Descriptions

Mao, Junhua; Huang, Jonathan; Toshev, Alexander; Camburu, Oana; Yuille, Alan; Murphy, Kevin

doi:10.1109/cvpr.2016.9

Cited by 960 publications

(1,022 citation statements)

References 44 publications

Supporting

Mentioning

1,017

Contrasting

Unclassified

Order By: Relevance

“…We are inspired by recent progress in object retrieval [15,14,26]. Both Hu et al [15,14] and Mao et al [26] present a recurrent neural network able to localize an object in an image by means of a natural language query only, either returning a bounding box [15] or a free-form segment [14].…”

Section: Introductionmentioning

confidence: 99%

“…Both Hu et al [15,14] and Mao et al [26] present a recurrent neural network able to localize an object in an image by means of a natural language query only, either returning a bounding box [15] or a free-form segment [14]. To cope with language ambiguity, Mao et al introduce referring expressions that uniquely describe an object in an image.…”

Section: Introductionmentioning

confidence: 99%

“…Unlike [15,14,26] we do not retrieve but track the object of interest in video from a natural language specification.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Tracking by Natural Language Specification

Tao

Gavves

et al. 2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

127

176

View full text Add to dashboard Cite

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Tracking by Natural Language Specification

Tao

Gavves

et al. 2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

127

176

View full text Add to dashboard Cite

show abstract

“…Sentence semantics only provides ambiguous and implicit labels. This resembles another line of work that learns structured output from image captions (Berg et al 2004;Gupta and Davis 2008;Luo et al 2009;Jamieson et al 2010a, b;Plummer et al 2015;Mao et al 2016), treating the input as a parallel image-text dataset. However, all of these methods, except Gupta and Davis (2008) and Jamieson et al (2010a, b) use pretrained object models learned from other datasets.…”

Section: Related Workmentioning

confidence: 99%

Sentence Directed Video Object Codiscovery

Yu¹,

Siskind

2017

Int J Comput Vis

View full text Add to dashboard Cite

Video object codiscovery can leverage the weak semantic constraint implied by sentences that describe the video content. Our codiscovery method, like other object codetection techniques, does not employ any pretrained object models or detectors. Unlike most prior work that focuses on codetecting large objects which are usually salient both in size and appearance, our method can discover small or medium sized objects as well as ones that may be occluded for part of the video. More importantly, our method can codiscover multiple object instances of different classes within a single video clip. Although the semantic information employed is usually simple and weak, it can greatly boost performance by constraining the hypothesized object locations. Experiments show promising results on three datasets: an average IoU score of 0.423 on a new dataset with 15 object

show abstract

“…These features are then associated in a learning process with certain words, resulting in an association of colour features with colour words, spatial features with prepositions, etc., and based on this, these words can be interpreted with reference to the scene currently presented to the video feed. Whereas Roy's work still looked at relatively simple scenes with graphical objects, research on REG has recently started to investigate set-ups based on real-world images (Kazemzadeh et al, 2014;Gkatzia et al, 2015;Zarrieß and Schlangen, 2016;Mao et al, 2015). Importantly, the lowlevel visual features that can be extracted from these scenes correspond less directly to particular word classes.…”

Section: Related Workmentioning

confidence: 99%

Obtaining referential word meanings from visual and distributional information: Experiments on object naming

Zarrieß¹,

Schlangen²

2017

Proceedings of the 55th Annual Meeting of the Association For Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

We investigate object naming, which is an important sub-task of referring expression generation on real-world images. As opposed to mutually exclusive labels used in object recognition, object names are more flexible, subject to communicative preferences and semantically related to each other. Therefore, we investigate models of referential word meaning that link visual to lexical information which we assume to be given through distributional word embeddings. We present a model that learns individual predictors for object names that link visual and distributional aspects of word meaning during training. We show that this is particularly beneficial for zero-shot learning, as compared to projecting visual objects directly into the distributional space. In a standard object naming task, we find that different ways of combining lexical and visual information achieve very similar performance, though experiments on model combination suggest that they capture complementary aspects of referential meaning.

show abstract

Generation and Comprehension of Unambiguous Object Descriptions

Cited by 960 publications

References 44 publications

Tracking by Natural Language Specification

Tracking by Natural Language Specification

Sentence Directed Video Object Codiscovery

Obtaining referential word meanings from visual and distributional information: Experiments on object naming

Contact Info

Product

Resources

About