2021
DOI: 10.48550/arxiv.2109.07301
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

What Vision-Language Models `See' when they See Scenes

Michele Cafagna,
Kees van Deemter,
Albert Gatt

Abstract: Images can be described in terms of the objects they contain, or in terms of the types of scene or place that they instantiate. In this paper we address to what extent pretrained Vision and Language models can learn to align descriptions of both types with images. We compare 3 state-of-the-art models, VisualBERT, LXMERT and CLIP. We find that (i) V&L models are susceptible to stylistic biases acquired during pretraining; (ii) only CLIP performs consistently well on both object-and scene-level descriptions. A f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
3
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 32 publications
0
4
0
Order By: Relevance
“…To the best of our knowledge, scene context has not yet been integrated into visual REG models (but see Cafagna et al 2021 for the inverse task of grounding descriptions in images). However, it is save to assume that leveraging this kind of context is a challenging modeling problem in itself.…”
Section: Toward a Wider Notion Of Context In Visual Regmentioning
confidence: 99%
“…To the best of our knowledge, scene context has not yet been integrated into visual REG models (but see Cafagna et al 2021 for the inverse task of grounding descriptions in images). However, it is save to assume that leveraging this kind of context is a challenging modeling problem in itself.…”
Section: Toward a Wider Notion Of Context In Visual Regmentioning
confidence: 99%
“…For our evaluation, we choose CLIP, as opposed to other language-vision models, due to its vast popularity as a foundation model (Bommasani et al, 2022), i.e., its use in a multitude of models and its impressive zero-shot performance across various tasks and datasets, e.g., text-to-image retrieval, image question answering, human action segmentation, image-sentence alignment - (Cafagna et al, 2021). However, we observe these datasets contain mostly images from North America and Western Europe, and, to the best of our knowledge, we are the first to evaluate CLIP on more diverse data.…”
Section: State-of-the-art Vision-language Modelmentioning
confidence: 99%
“…Multimodal Grounding HL Dataset is also a useful resource to benchmark the grounding capabilities of large pre-trained V&L models. Along these lines, Cafagna et al (2021) study the capability of V&L models to understand scene descriptions in zero-shot settings, finding that only largescale pre-trained V&L models have enough generalization capabilities to handle unseen high-level scene descriptions. analyse the impact of exposure to high-level scene descriptions on multimodal representations in models pretrained on object-centric captions.…”
Section: Further Uses Of the Hl Datasetmentioning
confidence: 99%