Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022
DOI: 10.18653/v1/2022.acl-long.567
|View full text |Cite
|
Sign up to set email alerts
|

VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena

Abstract: We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We build VALSE using methods that support the co… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
24
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 39 publications
(25 citation statements)
references
References 24 publications
1
24
0
Order By: Relevance
“…Norlund et al (2021) investigated the effect of multimodal training on textual representations, and concluded that the degree of transfer between the representations of the respective modalities is limited, at least for CNN-based models; Hagström and Johansson (2022a,b) drew similar conclusions based on more extensive experiments that also include the FLAVA model. Parcalabescu et al (2021) considered the task of predicting numbers and arrived at a conclusion similar to ours: frequently occurring numbers are predicted more often by the model.…”
Section: Related Workmentioning
confidence: 60%
See 3 more Smart Citations
“…Norlund et al (2021) investigated the effect of multimodal training on textual representations, and concluded that the degree of transfer between the representations of the respective modalities is limited, at least for CNN-based models; Hagström and Johansson (2022a,b) drew similar conclusions based on more extensive experiments that also include the FLAVA model. Parcalabescu et al (2021) considered the task of predicting numbers and arrived at a conclusion similar to ours: frequently occurring numbers are predicted more often by the model.…”
Section: Related Workmentioning
confidence: 60%
“…The previous work that is most closely related to our in terms of research questions and methodology is that by Frank et al (2021). They designed ablation tests where parts of the image or the text are hidden; as we have discussed, this setup is comparable to our experiments where black and white-noise images are used.…”
Section: Related Workmentioning
confidence: 91%
See 2 more Smart Citations
“…To evaluate extensions and adaptations of VQA models aimed at people with visual impairments, this work will study and develop a vision and language oriented check list based on VALSE (Parcalabescu et al, 2021). This is a novel benchmark designed to test visual-linguistic capabilities on pretrained general-purpose language and vision mod-els.…”
Section: Design Of a Checklist Oriented To Vision And Languagementioning
confidence: 99%