2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
DOI: 10.1109/cvpr.2017.120
|View full text |Cite
|
Sign up to set email alerts
|

Context-Aware Captions from Context-Agnostic Supervision

Abstract: We introduce an inference technique to produce discriminative context-aware image captions (captions that describe differences between images or visual concepts) using only generic context-agnostic training data (captions that describe a concept or an image in isolation). For example, given images and captions of "siamese cat" and "tiger cat", we generate language that describes the "siamese cat" in a way that distinguishes it from "tiger cat". Our key novelty is that we show how to do joint inference over a l… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

2
141
0
1

Year Published

2018
2018
2020
2020

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 116 publications
(144 citation statements)
references
References 36 publications
2
141
0
1
Order By: Relevance
“…distinguishing one image from another. Similarly, image descriptions are generated to make the target image distinguishable from a similar image in [28], and referential expressions are generated on objects in a discriminative way such that one can correctly localize the mentioned object from the generated expression in [17]. In this work, we generate textual explanation to maximize both class-specificity and image-relevance.…”
Section: Related Workmentioning
confidence: 99%
“…distinguishing one image from another. Similarly, image descriptions are generated to make the target image distinguishable from a similar image in [28], and referential expressions are generated on objects in a discriminative way such that one can correctly localize the mentioned object from the generated expression in [17]. In this work, we generate textual explanation to maximize both class-specificity and image-relevance.…”
Section: Related Workmentioning
confidence: 99%
“…A few prior works explore caption sampling and re-scoring during inference [2,18,56]. Specifically, [18] aim to obtain more imagegrounded bird explanations, while [2,56] aim to generate discriminative captions for a given distractor image. While our approach is similar, our goal is different, as we work with video rather than images, and aim to improve multisentence description with respect to multiple properties.…”
Section: Related Workmentioning
confidence: 99%
“…In recent years, a variety of successive models [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][18][19][20] have achieved promising results. To generate captions, semantic concepts or attributes of objects in images are detected and utilized as inputs of the RNN decoder [3,6,12,20,22]. To generate captions, semantic concepts or attributes of objects in images are detected and utilized as inputs of the RNN decoder [3,6,12,20,22].…”
Section: Deep Image Captioningmentioning
confidence: 99%
“…The encoder-decoder model first extracts high-level visual features from a CNN trained on the image classification task, and then feeds the visual features into an RNN model to predict subsequent words of a caption for a given image. In recent years, a variety of successive models [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][18][19][20] have achieved promising results. Semantic concept analysis, or attribute prediction [17,21], is a task closely related to image captioning, because attributes can be interpreted as a basis for descriptions.…”
Section: Deep Image Captioningmentioning
confidence: 99%
See 1 more Smart Citation