Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Krishna, Ranjay; Zhu, Yuke; Groth, Oliver; Johnson, Justin; Hata, Kenji; Kravitz, Joshua; Chen, Stephanie; Kalantidis, Yannis; Li, Lijia; Shamma, David A.; Bernstein, Michael S.; Li, Feifei

doi:10.1007/s11263-016-0981-7

Cited by 4,385 publications

(3,285 citation statements)

References 100 publications

Supporting

Mentioning

3,271

Contrasting

Unclassified

Order By: Relevance

“…A number of recent works have proposed visual question answering datasets [3,22,26,31,10,46,38,36] and models [9,25,2,43,24,27,47,45,44,41,35,20,29,15,42,33,17]. Our work builds on top of the VQA dataset from Antol et al [3], which is one of the most widely used VQA datasets.…”

Section: Related Workmentioning

confidence: 99%

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Goyal

Khot

Summers-Stay

et al. 2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

1,754

1,372

View full text Add to dashboard Cite

Problems at the intersection of vision and language are of significant importance both as challenging research questions and for the rich set of applications they enable. However, inherent structure in our world and bias in our language tend to be a simpler signal for learning than visual modalities, resulting in models that ignore visual information, leading to an inflated sense of their capability.We propose to counter these language priors for the task of Visual Question Answering (VQA) and make vision (the V in VQA) matter! Specifically, we balance the popular VQA dataset [3] by collecting complementary images such that every question in our balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question. Our dataset is by construction more balanced than the original VQA dataset and has approximately twice the number of image-question pairs. Our complete balanced dataset is publicly available at http://visualqa.org/ as part of the 2nd iteration of the Visual Question Answering Dataset and Challenge (VQA v2.0).We further benchmark a number of state-of-art VQA models on our balanced dataset. All models perform significantly worse on our balanced dataset, suggesting that these models have indeed learned to exploit language priors. This finding provides the first concrete empirical evidence for what seems to be a qualitative sense among practitioners.Finally, our data collection protocol for identifying complementary images enables us to develop a novel interpretable model, which in addition to providing an answer to the given (image, question) pair, also provides a counterexample based explanation. Specifically, it identifies an image that is similar to the original image, but it believes has a different answer to the same question. This can help in building trust for machines among their users. * The first two authors contributed equally.

show abstract

Section: Related Workmentioning

confidence: 99%

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Goyal

Khot

Summers-Stay

et al. 2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

1,754

1,372

View full text Add to dashboard Cite

show abstract

“…None of these works objectively assess the quality of scene graph hypotheses compared with ground truth graphs. However, reasonable measures for this problem are important especially after the publication of the Visual Genome dataset [14].…”

Section: Related Workmentioning

confidence: 99%

On support relations and semantic scene graphs

Yang¹,

Liao

Ackermann

et al. 2017

ISPRS Journal of Photogrammetry and Remote Sensing

View full text Add to dashboard Cite

Abstract-Rapid development of robots and autonomous vehicles requires semantic information about the surrounding scene to decide upon the correct action or to be able to complete particular tasks. Scene understanding provides the necessary semantic interpretation by semantic scene graphs. For this task, so-called support relationships which describe the contextual relations between parts of the scene such as floor, wall, table, etc, need be known. This paper presents a novel approach to infer such relations and then to construct the scene graph. Support relations are estimated by considering important, previously ignored information: the physical stability and the prior support knowledge between object classes. In contrast to previous methods for extracting support relations, the proposed approach generates more accurate results, and does not require a pixel-wise semantic labeling of the scene. The semantic scene graph which describes all the contextual relations within the scene is constructed using this information. To evaluate the accuracy of these graphs, multiple different measures are formulated. The proposed algorithms are evaluated using the NYUv2 database. The results demonstrate that the inferred support relations are more precise than state-of-the-art. The scene graphs are compared against ground truth graphs.

show abstract

“…A simple form of description can be generated from such region labels, but it would not be much more than a list. A step further is recent work on visual relationship detection [16], [17], [18] where relations between objects are identified in addition to the objects themselves.…”

Section: Related Research In Image Descriptionmentioning

confidence: 99%

Learning to Generate Descriptions of Visual Data Anchored in Spatial Relations

Muscat

Belz

2017

IEEE Comput. Intell. Mag.

View full text Add to dashboard Cite

Abstract-The explosive growth of visual data both online and offline in private and public repositories has led to urgent requirements for better ways to index, search, retrieve, process and manage visual content. Automatic methods for generating image descriptions can help with all these tasks as well as playing an important role in assistive technology for the visually impaired. The task we address in this paper is the automatic generation of image descriptions that are anchored in spatial relations. We construe this as a three-step task where the first step is to identify objects in an image, the second step detects spatial relations between object pairs on the basis of language and visual features; and in the third step, the spatial relations are mapped to natural language (NL) descriptions. We describe the data we have created, and compare a range of machine learning methods in terms of the success with which they learn the mapping from features to spatial relations, using automatic and human-assessed evaluations. We find that a random forest model performs best by a substantial margin. We examine aspects of our approach in more detail, including data annotation and choice of features. For Step 3, we describe six alternative natural language generation (NLG) strategies, evaluate the resulting NL strings using measures of correctness, naturalness and completeness. Finally we discuss evaluation issues, including the importance of extrinsic context in data creation and evaluation design.

show abstract

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Cited by 4,385 publications

References 100 publications

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

On support relations and semantic scene graphs

Learning to Generate Descriptions of Visual Data Anchored in Spatial Relations

Contact Info

Product

Resources

About