2017
DOI: 10.1007/s11263-016-0981-7
|View full text |Cite
|
Sign up to set email alerts
|

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Abstract: Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

6
3,271
0
8

Year Published

2017
2017
2022
2022

Publication Types

Select...
7
3

Relationship

1
9

Authors

Journals

citations
Cited by 4,385 publications
(3,285 citation statements)
references
References 100 publications
6
3,271
0
8
Order By: Relevance
“…A number of recent works have proposed visual question answering datasets [3,22,26,31,10,46,38,36] and models [9,25,2,43,24,27,47,45,44,41,35,20,29,15,42,33,17]. Our work builds on top of the VQA dataset from Antol et al [3], which is one of the most widely used VQA datasets.…”
Section: Related Workmentioning
confidence: 99%
“…A number of recent works have proposed visual question answering datasets [3,22,26,31,10,46,38,36] and models [9,25,2,43,24,27,47,45,44,41,35,20,29,15,42,33,17]. Our work builds on top of the VQA dataset from Antol et al [3], which is one of the most widely used VQA datasets.…”
Section: Related Workmentioning
confidence: 99%
“…None of these works objectively assess the quality of scene graph hypotheses compared with ground truth graphs. However, reasonable measures for this problem are important especially after the publication of the Visual Genome dataset [14].…”
Section: Related Workmentioning
confidence: 99%
“…A simple form of description can be generated from such region labels, but it would not be much more than a list. A step further is recent work on visual relationship detection [16], [17], [18] where relations between objects are identified in addition to the objects themselves.…”
Section: Related Research In Image Descriptionmentioning
confidence: 99%