“…Consequently, models trained on such datasets do not take into account linguistic diversity (Ponti et al, 2020) or cross-cultural nuances (Liu et al, 2021). 1 The need to expand V&L research towards more languages has been recognised by 1) the recent creation of multilingual training and evaluation data across diverse V&L tasks and languages (Srinivasan et al, 2021;Su et al, 2021;Pfeiffer et al, 2021;Liu et al, 2021;Wang et al, 2021, inter alia), as well as 2) the emergence of the first large multilingual-multimodal pretrained models (Ni et al, 2021;Zhou et al, 2021;Liu et al, 2021) and monolingual V&L models adapted to multiple languages (Chen et al, 2020;Pfeiffer et al, 2021). In this work, we merge and expand on these two research threads, aiming to highlight current achievements and challenges in this area and to facilitate comparative evaluations, thus bringing together the abovementioned collective research efforts.…”