Visually Grounded Reasoning across Languages and Cultures

Liu, Fangyu; Bugliarello, Emanuele; Ponti, Edoardo Maria; Reddy, Siva; Collier, Nigel; Elliott, Desmond

doi:10.18653/v1/2021.emnlp-main.818

Cited by 56 publications

(83 citation statements)

References 60 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Visio-linguistic stress testing. There are a number of existing multimodal stress tests about correctly understanding implausible scenes [13], exploitation of language and vision priors [11,27], single word mismatches [64], hate speech detection [26,32,41,92], memes [39,75], ablation of one modality to probe the other [22], distracting models with visual similarity between images [7,33], distracting models with textual similarity between many suitable captions [1,17], collecting more diverse image-caption pairs beyond the predominately English and North American/Western European datasets [50], probing for an understanding of verb-argument relationships [30], counting [53], or specific model failure modes [65,69]. Many of these stress tests rely only on synthetically generated images, often with minimal visual differences, but no correspondingly minimal textual changes [80].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

Thrush¹,

Jiang²,

Bartolo³

et al. 2022

Preprint

View full text Add to dashboard Cite

We present a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two images and two captions, the goal is to match them correctly-but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annotators and is labeled with a rich set of fine-grained tags to assist in analyzing model performance. We probe a diverse range of state-of-the-art vision and language models and find that, surprisingly, none of them do much better than chance. Evidently, these models are not as skilled at visio-linguistic compositional reasoning as we might have hoped. We perform an extensive analysis to obtain insights into how future work might try to mitigate these models' shortcomings. We aim for Winoground to serve as a useful evaluation set for advancing the state of the art and driving further progress in the field. The dataset is available at https://huggingface.co/datasets/facebook/winoground.

show abstract

Section: Related Workmentioning

confidence: 99%

“…Winoground is English-only and translation to other languages may be nontrivial [50]. Expert curation is time-consuming and our dataset is limited in size.…”

Section: Perplexitymentioning

confidence: 99%

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

Thrush¹,

Jiang²,

Bartolo³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Only recently have multilingual multimodal benchmarks been developed (Srinivasan et al, 2021;Liu et al, 2021b;Pfeiffer et al, 2021;Bugliarello et al, 2022, inter alia) making it possible to evaluate multimodal models which have either been pretrained on multilingual data (Ni et al, 2021;Zhou et al, 2021) or extended to unseen languages (Liu et al, 2021b;Pfeiffer et al, 2021).…”

Section: Related Workmentioning

confidence: 99%

“…The lack of multilingual resources has hindered the development and evaluation of Visual Question Answering (VQA) methods beyond the English language. Only recently, there has been a rise of interest in creating multilingual Vision-and-Language (V&L) resources which have also inspired more research in this area (Srinivasan et al, 2021;Liu et al, 2021b;Pfeiffer et al, 2021;Bugliarello et al, 2022, inter alia). Large Transformer-based models pretrained on images and text in multiple different languages have been proven as a viable vehicle for the development of multilingual V&L task architectures through transfer learning, but such models are still few and far between (M3P, UC2; Ni et al, 2021;Zhou et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

Delving Deeper into Cross-lingual Visual Question Answering

Liu¹,

Pfeiffer²,

Korhonen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Visual question answering (VQA) is one of the crucial vision-and-language tasks. Yet, the bulk of research until recently has focused only on the English language due to the lack of appropriate evaluation resources. Previous work on cross-lingual VQA has reported poor zeroshot transfer performance of current multilingual multimodal Transformers and large gaps to monolingual performance, attributed mostly to misalignment of text embeddings between the source and target languages, without providing any additional deeper analyses. In this work, we delve deeper and address different aspects of cross-lingual VQA holistically, aiming to understand the impact of input data, fine-tuning and evaluation regimes, and interactions between the two modalities in crosslingual setups. 1) We tackle low transfer performance via novel methods that substantially reduce the gap to monolingual English performance, yielding +10 accuracy points over existing methods. 2) We study and dissect crosslingual VQA across different question types of varying complexity, across different multilingual multi-modal Transformers, and in zeroshot and few-shot scenarios. 3) We further conduct extensive analyses on modality biases in training data and models, aimed at understanding why zero-shot performance gaps remain for some question types and languages. We hope that the novel methods and detailed analyses will guide further progress in multilingual VQA.

show abstract

“…There is a long tradition of grounding language understanding on single images, in the form of visual question answering (Goyal et al, 2017;Hudson and Manning, 2019), visual dialogue (de Vries et al, 2017;Das et al, 2017), or visual entailment (Xie et al, 2019). Recently, more and more focus has been directed to settings where the visual context consists of multiple images, either conventional static pictures (Vedantam et al, 2017;Hu et al, 2019;Suhr et al, 2019;Forbes et al, 2019;Hendricks and Nematzadeh, 2021;Yan et al, 2021;Hosseinzadeh and Wang, 2021;Bogin et al, 2021;Liu et al, 2021), or video frames (Jhamtani and Berg-Kirkpatrick, 2018a;Bansal et al, 2020). While many of these benchmarks involve just two images, COVR (Bogin et al, 2021) and ISVQA (Bansal et al, 2020) provide more images, similar to our sets of 10 images.…”

Section: Related Workmentioning

confidence: 99%

Image Retrieval from Contextual Descriptions

Krojer¹,

Adlakha²,

Vineet³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

The ability to integrate context, including perceptual and temporal cues, plays a pivotal role in grounding the meaning of a linguistic utterance. In order to measure to what extent current vision-and-language models master this ability, we propose a new multimodal challenge, Image Retrieval from Contextual Descriptions (IMAGECODE). In particular, models are tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description. As such, each description contains only the details that help distinguish between images. Because of this, descriptions tend to be complex in terms of syntax and discourse and require drawing pragmatic inferences. Images are sourced from both static pictures and video frames. We benchmark several state-of-the-art models, including both cross-encoders such as ViLBERT and bi-encoders such as CLIP, on IMAGECODE. Our results reveal that these models dramatically lag behind human performance: the best variant achieves an accuracy of 20.9 on video frames and 59.4 on static pictures, compared with 90.8 in humans. Furthermore, we experiment with new model variants that are better equipped to incorporate visual and temporal context into their representations, which achieve modest gains. Our hope is that IMAGECODE will foster progress in grounded language understanding by encouraging models to focus on fine-grained visual differences. We make code and dataset publicly available. 1

show abstract

Visually Grounded Reasoning across Languages and Cultures

Cited by 56 publications

References 60 publications

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

Delving Deeper into Cross-lingual Visual Question Answering

Image Retrieval from Contextual Descriptions

Contact Info

Product

Resources

About