Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2019
DOI: 10.18653/v1/p19-1644
|View full text |Cite
|
Sign up to set email alerts
|

A Corpus for Reasoning about Natural Language Grounded in Photographs

Abstract: We introduce a new dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges. The data contains 107,292 examples of English sentences paired with web photographs. The task is to determine whether a natural language caption is true about a pair of photographs. We crowdsource the data using sets of visually rich images and a compare-and-contrast task to elicit linguistically diverse language. Qualitative analysis shows the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
241
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3
3

Relationship

0
10

Authors

Journals

citations
Cited by 333 publications
(273 citation statements)
references
References 42 publications
0
241
0
Order By: Relevance
“…We test 3 models that have proved effective in visual reasoning tasks (Johnson et al, 2017a;Suhr et al, 2018;Yi et al, 2018). All models are multi-modal, i.e., they use both a visual representation of the scene and a linguistic representation of the sentence.…”
Section: Modelsmentioning
confidence: 99%
“…We test 3 models that have proved effective in visual reasoning tasks (Johnson et al, 2017a;Suhr et al, 2018;Yi et al, 2018). All models are multi-modal, i.e., they use both a visual representation of the scene and a linguistic representation of the sentence.…”
Section: Modelsmentioning
confidence: 99%
“…FOIL takes a different approach and requires a system to differentiate invalid image descriptions from valid ones (Shekhar et al, 2017). Natural Language Visual Reasoning (NLVR) requires verifying if image descriptions are true (Suhr et al, 2017(Suhr et al, , 2018.…”
Section: Tasks In Vandl Researchmentioning
confidence: 99%
“…These in turn are exploited by VQA models, which become heavily reliant upon such statistical biases and tendencies within the answer distribution to largely circumvent the need for true visual scene understanding [2,11,15,8]. This situation is exacerbated by the simplicity of many of the questions, from both linguistic and semantic perspectives, which in practice rarely require much beyond object recognition [33]. Consequently, early benchmarks led to an inflated sense of the state of scene understanding, severely diminishing their credibility [37].…”
Section: Introductionmentioning
confidence: 99%