2022
DOI: 10.48550/arxiv.2204.03162
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

Abstract: We present a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two images and two captions, the goal is to match them correctly-but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annotators and is labeled with a rich set of fine-grained tags to assist in analyzing model performance. We probe a divers… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
13
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(13 citation statements)
references
References 62 publications
0
13
0
Order By: Relevance
“…In contrast to their work, we investigate several VL models and evaluate their performance on language generation tasks. Bugliarello et al (2021), Hessel and Lee (2020), Thrush et al (2022) and Frank et al (2021) also perform extensive evaluations of several VL models such as LXMERT and VisualBERT. In contrast to our work, they primarily focus on the VL performance of the models, and do not consider the model performance on text-only input.…”
Section: Related Workmentioning
confidence: 99%
“…In contrast to their work, we investigate several VL models and evaluate their performance on language generation tasks. Bugliarello et al (2021), Hessel and Lee (2020), Thrush et al (2022) and Frank et al (2021) also perform extensive evaluations of several VL models such as LXMERT and VisualBERT. In contrast to our work, they primarily focus on the VL performance of the models, and do not consider the model performance on text-only input.…”
Section: Related Workmentioning
confidence: 99%
“…Image-text models that have been constrastively trained on internet-scale data, such as CLIP (Radford et al 2021a), have been shown to have strong zero-shot classification capabilities. However, recent works (Thrush et al 2022;Diwan et al 2022) have highlighted their limitations in visio-linguistic reasoning, as shown in the challenging Winoground benchmark. Yuksekgonul et al (2023) also observe this issue and introduce a new benchmark ARO for image-text models which require a significant amount of visio-linguistic reasoning to solve.…”
Section: Related Workmentioning
confidence: 99%
“…Winoground (Thrush et al 2022;Diwan et al 2022) is a challenging vision-language dataset for evaluating the visiolinguistic characteristics of contrastively trained image-text models. The dataset consists of 400 tasks, where each task consists of two image-text pairs.…”
Section: Benchmark Datasetsmentioning
confidence: 99%
See 2 more Smart Citations