2022
DOI: 10.48550/arxiv.2211.00768
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
6
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(6 citation statements)
references
References 0 publications
0
6
0
Order By: Relevance
“…ARO (Yuksekgonul et al 2023) similarly tests visiolinguistic reasoning and consists of three types of tasks: (i) Visual Genome Attribution to test the understanding of object properties; (ii) Visual Genome Attribution to test for relational understanding between objects; and (iii) COCO-Order and Flickr30k-Order to test for order sensitivity of the words in a text, when performing image-text matching. We highlight that Winoground though slightly smaller in size than ARO is more challenging as it requires reasoning beyond visio-linguistic compositional knowledge (Diwan et al 2022).…”
Section: Benchmark Datasetsmentioning
confidence: 99%
See 4 more Smart Citations
“…ARO (Yuksekgonul et al 2023) similarly tests visiolinguistic reasoning and consists of three types of tasks: (i) Visual Genome Attribution to test the understanding of object properties; (ii) Visual Genome Attribution to test for relational understanding between objects; and (iii) COCO-Order and Flickr30k-Order to test for order sensitivity of the words in a text, when performing image-text matching. We highlight that Winoground though slightly smaller in size than ARO is more challenging as it requires reasoning beyond visio-linguistic compositional knowledge (Diwan et al 2022).…”
Section: Benchmark Datasetsmentioning
confidence: 99%
“…Image-text models that have been constrastively trained on internet-scale data, such as CLIP (Radford et al 2021a), have been shown to have strong zero-shot classification capabilities. However, recent works (Thrush et al 2022;Diwan et al 2022) have highlighted their limitations in visio-linguistic reasoning, as shown in the challenging Winoground benchmark. Yuksekgonul et al (2023) also observe this issue and introduce a new benchmark ARO for image-text models which require a significant amount of visio-linguistic reasoning to solve.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations