2021
DOI: 10.48550/arxiv.2112.07566
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena

Abstract: We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We build VALSE using methods that support the co… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
14
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(14 citation statements)
references
References 33 publications
0
14
0
Order By: Relevance
“…The second strand of work in curating adversarial samples are efforts revolving around red-teaming (Ganguli et al, 2022;Perez et al, 2022) that aim to explicitly elicit certain sets of behaviours from foundation models; primarily these approaches look at the problem of adversarial benchmarking from a safety perspective. Further, a host of benchmarks that aim to stress-test models are making their way on the horizon-their primary goal is to create test sets for manually discovered failure modes (Yuksekgonul et al, 2022;Parcalabescu et al, 2021;Thrush et al, 2022;Udandarao et al, 2023;Hsieh et al, 2023;Kamath et al, 2023;Bitton-Guetta et al, 2023;Bordes et al, 2023). However, while they are sample efficient, they are criticized as unfair.…”
Section: Extended Related Workmentioning
confidence: 99%
“…The second strand of work in curating adversarial samples are efforts revolving around red-teaming (Ganguli et al, 2022;Perez et al, 2022) that aim to explicitly elicit certain sets of behaviours from foundation models; primarily these approaches look at the problem of adversarial benchmarking from a safety perspective. Further, a host of benchmarks that aim to stress-test models are making their way on the horizon-their primary goal is to create test sets for manually discovered failure modes (Yuksekgonul et al, 2022;Parcalabescu et al, 2021;Thrush et al, 2022;Udandarao et al, 2023;Hsieh et al, 2023;Kamath et al, 2023;Bitton-Guetta et al, 2023;Bordes et al, 2023). However, while they are sample efficient, they are criticized as unfair.…”
Section: Extended Related Workmentioning
confidence: 99%
“…The Visual Commonsense Tests (ViComTe) dataset (Zhang et al, 2022a) is created to test to what degree unimodal (language-only) and multimodal (image and language) models capture a broad range of visually salient attributes. VALSE (Parcalabescu et al, 2021) is proposed to test VLP models centered on linguistic phenomena. CARET (Jimenez et al, 2022) is proposed to systematically measure consistency and robustness of modern VQA models through six fine-grained capability tests.…”
Section: Robustness and Probing Analysismentioning
confidence: 99%
“…Visual Language Inference With the advent of visual language models (VLMs; Liu et al 2021;Li et al 2019;Cho et al 2021;Huang et al 2022) that can simultaneously process visual and linguistic information, there is growing attention to enrich text-only tasks with visual context (Parcalabescu et al, 2021;Xie et al, 2018;Vu et al, 2018). Vu et al (2018) propose a visuallygrounded version of the textual entailment task, supported by the cognitive science view of enriching meaning representations with multiple modalities.…”
Section: Related Workmentioning
confidence: 99%
“…Second, we propose three strategies for retrieving a rich amount of cheap and allowably noisy supervision signals for inference and rationalization. Similar to Parcalabescu et al (2021), PVLIR's three strategies rely on the available image captioning datasets (e.g. Changpinyo et al (2021); Sharma et al (2018); Gurari et al (2020a); Lin et al (2014a)) that are readily available as a result of years of research in the field and maturity of resources.…”
Section: Introductionmentioning
confidence: 99%