Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021
DOI: 10.18653/v1/2021.emnlp-main.818
|View full text |Cite
|
Sign up to set email alerts
|

Visually Grounded Reasoning across Languages and Cultures

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
82
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 56 publications
(83 citation statements)
references
References 60 publications
1
82
0
Order By: Relevance
“…Visio-linguistic stress testing. There are a number of existing multimodal stress tests about correctly understanding implausible scenes [13], exploitation of language and vision priors [11,27], single word mismatches [64], hate speech detection [26,32,41,92], memes [39,75], ablation of one modality to probe the other [22], distracting models with visual similarity between images [7,33], distracting models with textual similarity between many suitable captions [1,17], collecting more diverse image-caption pairs beyond the predominately English and North American/Western European datasets [50], probing for an understanding of verb-argument relationships [30], counting [53], or specific model failure modes [65,69]. Many of these stress tests rely only on synthetically generated images, often with minimal visual differences, but no correspondingly minimal textual changes [80].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Visio-linguistic stress testing. There are a number of existing multimodal stress tests about correctly understanding implausible scenes [13], exploitation of language and vision priors [11,27], single word mismatches [64], hate speech detection [26,32,41,92], memes [39,75], ablation of one modality to probe the other [22], distracting models with visual similarity between images [7,33], distracting models with textual similarity between many suitable captions [1,17], collecting more diverse image-caption pairs beyond the predominately English and North American/Western European datasets [50], probing for an understanding of verb-argument relationships [30], counting [53], or specific model failure modes [65,69]. Many of these stress tests rely only on synthetically generated images, often with minimal visual differences, but no correspondingly minimal textual changes [80].…”
Section: Related Workmentioning
confidence: 99%
“…Winoground is English-only and translation to other languages may be nontrivial [50]. Expert curation is time-consuming and our dataset is limited in size.…”
Section: Perplexitymentioning
confidence: 99%
“…Only recently have multilingual multimodal benchmarks been developed (Srinivasan et al, 2021;Liu et al, 2021b;Pfeiffer et al, 2021;Bugliarello et al, 2022, inter alia) making it possible to evaluate multimodal models which have either been pretrained on multilingual data (Ni et al, 2021;Zhou et al, 2021) or extended to unseen languages (Liu et al, 2021b;Pfeiffer et al, 2021).…”
Section: Related Workmentioning
confidence: 99%
“…The lack of multilingual resources has hindered the development and evaluation of Visual Question Answering (VQA) methods beyond the English language. Only recently, there has been a rise of interest in creating multilingual Vision-and-Language (V&L) resources which have also inspired more research in this area (Srinivasan et al, 2021;Liu et al, 2021b;Pfeiffer et al, 2021;Bugliarello et al, 2022, inter alia). Large Transformer-based models pretrained on images and text in multiple different languages have been proven as a viable vehicle for the development of multilingual V&L task architectures through transfer learning, but such models are still few and far between (M3P, UC2; Ni et al, 2021;Zhou et al, 2021).…”
Section: Introductionmentioning
confidence: 99%
“…There is a long tradition of grounding language understanding on single images, in the form of visual question answering (Goyal et al, 2017;Hudson and Manning, 2019), visual dialogue (de Vries et al, 2017;Das et al, 2017), or visual entailment (Xie et al, 2019). Recently, more and more focus has been directed to settings where the visual context consists of multiple images, either conventional static pictures (Vedantam et al, 2017;Hu et al, 2019;Suhr et al, 2019;Forbes et al, 2019;Hendricks and Nematzadeh, 2021;Yan et al, 2021;Hosseinzadeh and Wang, 2021;Bogin et al, 2021;Liu et al, 2021), or video frames (Jhamtani and Berg-Kirkpatrick, 2018a;Bansal et al, 2020). While many of these benchmarks involve just two images, COVR (Bogin et al, 2021) and ISVQA (Bansal et al, 2020) provide more images, similar to our sets of 10 images.…”
Section: Related Workmentioning
confidence: 99%