VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena

bescu, Letitia Parcala; Cafagna, Michele; Muradjan, Lilitta; Frank, Anette; Calixto, Iacer; Gatt, Albert

doi:10.18653/v1/2022.acl-long.567

Cited by 39 publications

(25 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Norlund et al (2021) investigated the effect of multimodal training on textual representations, and concluded that the degree of transfer between the representations of the respective modalities is limited, at least for CNN-based models; Hagström and Johansson (2022a,b) drew similar conclusions based on more extensive experiments that also include the FLAVA model. Parcalabescu et al (2021) considered the task of predicting numbers and arrived at a conclusion similar to ours: frequently occurring numbers are predicted more often by the model.…”

Section: Related Workmentioning

confidence: 60%

“…The previous work that is most closely related to our in terms of research questions and methodology is that by Frank et al (2021). They designed ablation tests where parts of the image or the text are hidden; as we have discussed, this setup is comparable to our experiments where black and white-noise images are used.…”

Section: Related Workmentioning

confidence: 91%

“…They designed ablation tests where parts of the image or the text are hidden; as we have discussed, this setup is comparable to our experiments where black and white-noise images are used. Parcalabescu et al (2022) introduced the idea of "foils": texts that differs minimally from the one describing the image. Our use of adversarially selected images can be seen as similar to the idea of foils, but focused on the visual modality.…”

Section: Related Workmentioning

confidence: 99%

“…With the development of multimodal LMs, some recent work has investigated what information is stored in the representations of the multiple modalities and how the multiple representations interact. For instance, Frank et al (2021) carried out a set of controlled tests to tease apart the effects of the textual and visual modalities.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Controlling for Stereotypes in Multimodal Language Model Evaluation

Malik¹,

Johansson²

2023

Preprint

View full text Add to dashboard Cite

We propose a methodology and design two benchmark sets for measuring to what extent language-and-vision language models use the visual signal in the presence or absence of stereotypes. The first benchmark is designed to test for stereotypical colors of common objects, while the second benchmark considers gender stereotypes. The key idea is to compare predictions when the image conforms to the stereotype to predictions when it does not.Our results show that there is significant variation among multimodal models: the recent Transformer-based FLAVA seems to be more sensitive to the choice of image and less affected by stereotypes than older CNN-based models such as VisualBERT and LXMERT. This effect is more discernible in this type of controlled setting than in traditional evaluations where we do not know whether the model relied on the stereotype or the visual signal.

show abstract

Section: Related Workmentioning

confidence: 60%

Section: Related Workmentioning

confidence: 91%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Controlling for Stereotypes in Multimodal Language Model Evaluation

Malik¹,

Johansson²

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…To evaluate extensions and adaptations of VQA models aimed at people with visual impairments, this work will study and develop a vision and language oriented check list based on VALSE (Parcalabescu et al, 2021). This is a novel benchmark designed to test visual-linguistic capabilities on pretrained general-purpose language and vision mod-els.…”

Section: Design Of a Checklist Oriented To Vision And Languagementioning

confidence: 99%

Automatic multi-modal processing of language and vision to assist people with visual impairments

Maina¹,

Benotti²

2022

LatinX in AI at North American Chapter of the Association for Computational Linguistics Conference 2022

View full text Add to dashboard Cite

In recent years, the study of the intersection between vision and language modalities, specifically in visual question answering (VQA) models, has gained significant appeal due to its great potential in assistive applications for people with visual disabilities. Despite this, to date, many of the existing VQA models are nor applicable to this goal for at least three reasons. To begin with, they are designed to respond to a single question. That is, they are not able to give feedback to incomplete or incremental questions. Secondly, they only consider a single image which is neither blurred, nor poorly focused, nor poorly framed. All these problems are directly related to the loss of the visual capacity. People with visual disabilities may have trouble interacting with a visual user interface for asking questions and for taking adequate photographs. They also frequently need to read text captured by the images, and most current VQA systems fall short in this task. This work presents a PhD proposal with four lines of research that will be carried out until December 2025. It investigates techniques that increase the robustness of the VQA models. In particular we propose the integration of dialogue history, the analysis of more than one input image, and the incorporation of text recognition capabilities to the models. All of these contributions are motivated to assist people with vision problems with their day-to-day tasks.

show abstract

Automatic Generation of Coherent Natural Language Texts

Marchenko,

Isoieva

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena

Cited by 39 publications

References 24 publications

Controlling for Stereotypes in Multimodal Language Model Evaluation

Controlling for Stereotypes in Multimodal Language Model Evaluation

Automatic multi-modal processing of language and vision to assist people with visual impairments

Automatic Generation of Coherent Natural Language Texts

Contact Info

Product

Resources

About