2022
DOI: 10.48550/arxiv.2207.12576
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models

Abstract: While vision-and-language models perform well on tasks such as visual question answering, they struggle when it comes to basic human commonsense reasoning skills. In this work, we introduce WinoGAViL: an online game to collect visionand-language associations, (e.g., werewolves to a full moon), used as a dynamic benchmark to evaluate state-of-the-art models. Inspired by the popular card game Codenames, a "spymaster" gives a textual cue related to several visual candidates, and another player has to identify the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(5 citation statements)
references
References 22 publications
0
4
0
Order By: Relevance
“…An annotated list of commonsense benchmarks has been assembled on a web site. 13 Benchmarks are divided into text-based, image-based, video-based, and simulated environments. (The site also includes a collection of symbolic commonsense knowledge-bases, which are primarily resources rather than benchmarks.)…”
Section: The Existing Benchmarks For Automated Commonsense Reasoningmentioning
confidence: 99%
See 2 more Smart Citations
“…An annotated list of commonsense benchmarks has been assembled on a web site. 13 Benchmarks are divided into text-based, image-based, video-based, and simulated environments. (The site also includes a collection of symbolic commonsense knowledge-bases, which are primarily resources rather than benchmarks.)…”
Section: The Existing Benchmarks For Automated Commonsense Reasoningmentioning
confidence: 99%
“…WinoX [39] French, German, Russian [83] answering 150,000 questions COFAR [47] Find an image 25,300 images Expert construction matching a query 40,800 queries CoSim [74] Counterfactual 3500 instances Crowd sourcing reasoning about images CRIC [44] Compositional 96,000 images Synthesized reasoning 494,000 questions e-SNLI-VE [71] Visual-textual 430,000 Synthesized entailment from SNLI-VE FVQA [138] Visual question 2190 images Synthesized answering GD-VCR [149] Visual question 328 images Expert construction answering 886 Q/A pairs Half&Half [123] Reasoning with text 126,000 examples Synthesized and incomplete images HumanCog [151] Who in image 67,000 images Extracted from VCR is being described? 138,000 descriptions + crowd sourcing HVQR [21] Visual question 33,000 images Synthesized answering 157,000 Q/A pairs IconQA [94] Visual question 107,400 instances Crowd sourcing answering KB-VQA [137] Visual question 2190 images Synthesized answering Naive action-Match image 1400 text effects Crowd sourcing effect prediction [45] to effect of action 4163 images PTR [61] Visual question 80,000 images Synthesized (both answering 800,000 images images and Q/A pairs) Sherlock [60] Inferences from 103,000 images Crowd sourcing images 363,000 inferences VCR [155] Visual question 290,000 questions Crowd sourcing answering Visual Visual question 108,000 images Crowd sourcing Genome [78] answering WinoGAViL [13] Match image to text 4482 examples Gamification Table 8: Image benchmarks Name Task Size Construction AGENT [121] Is this surprising? 8400 videos Synthesized AGQA [54] Spatio-temporal 9600 videos.…”
Section: Originalmentioning
confidence: 99%
See 1 more Smart Citation
“…Other recent works include tasks that evaluate compositionality, visual understanding (Zellers et al 2019), association (Bitton et al 2022), analogy (Vedantam et al 2015, neural reasoning (Forbes, Holtzman, and Choi 2019) and and visual common sense (Bitton-Guetta et al 2023).…”
Section: Related Workmentioning
confidence: 99%
“…These VLP models also achieve far worse performance on the new benchmark than on standard VQAv2 dataset. More recently, Bitton et al (2022) introduce WinoGAViL, which is an online game to collect VL associations, used as a dynamic benchmark to evaluate state-of-the-art VLP models. On one hand, these benchmarks are valuable as they successfully demonstrate the weaknesses of the SoTA VLP models, and shed new light on robustness studies in the community.…”
Section: Robustness and Probing Analysismentioning
confidence: 99%