2021
DOI: 10.48550/arxiv.2109.06082
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

xGQA: Cross-Lingual Visual Question Answering

Abstract: Recent advances in multimodal vision and language modeling have predominantly focused on the English language, mostly due to the lack of multilingual multimodal datasets to steer modeling efforts. In this work, we address this gap and provide xGQA, a new multilingual evaluation benchmark for the visual question answering task. We extend the established English GQA dataset (Hudson and Manning, 2019) to 7 typologically diverse languages, enabling us to detect and explore crucial challenges in cross-lingual visua… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

1
0

Authors

Journals

citations
Cited by 1 publication
(7 citation statements)
references
References 32 publications
0
7
0
Order By: Relevance
“…Consequently, models trained on such datasets do not take into account linguistic diversity (Ponti et al, 2020) or cross-cultural nuances (Liu et al, 2021). 1 The need to expand V&L research towards more languages has been recognised by 1) the recent creation of multilingual training and evaluation data across diverse V&L tasks and languages (Srinivasan et al, 2021;Su et al, 2021;Pfeiffer et al, 2021;Liu et al, 2021;Wang et al, 2021, inter alia), as well as 2) the emergence of the first large multilingual-multimodal pretrained models (Ni et al, 2021;Zhou et al, 2021;Liu et al, 2021) and monolingual V&L models adapted to multiple languages (Chen et al, 2020;Pfeiffer et al, 2021). In this work, we merge and expand on these two research threads, aiming to highlight current achievements and challenges in this area and to facilitate comparative evaluations, thus bringing together the abovementioned collective research efforts.…”
Section: Related Work and Motivationmentioning
confidence: 99%
See 4 more Smart Citations
“…Consequently, models trained on such datasets do not take into account linguistic diversity (Ponti et al, 2020) or cross-cultural nuances (Liu et al, 2021). 1 The need to expand V&L research towards more languages has been recognised by 1) the recent creation of multilingual training and evaluation data across diverse V&L tasks and languages (Srinivasan et al, 2021;Su et al, 2021;Pfeiffer et al, 2021;Liu et al, 2021;Wang et al, 2021, inter alia), as well as 2) the emergence of the first large multilingual-multimodal pretrained models (Ni et al, 2021;Zhou et al, 2021;Liu et al, 2021) and monolingual V&L models adapted to multiple languages (Chen et al, 2020;Pfeiffer et al, 2021). In this work, we merge and expand on these two research threads, aiming to highlight current achievements and challenges in this area and to facilitate comparative evaluations, thus bringing together the abovementioned collective research efforts.…”
Section: Related Work and Motivationmentioning
confidence: 99%
“…We combine the text-only dataset SNLI (Bowman et al, 2015), with its multimodal (Xie et al, 2019) (Krishna et al, 2017). In particular, we use the English balanced training set to train our models, and evaluate on the few-shot evaluation sets defined by Pfeiffer et al (2021) to allow for direct comparison between zero-shot and few-shot experiments.…”
Section: Xvnlimentioning
confidence: 99%
See 3 more Smart Citations