“…Our work builds off of a long literature on multilingual evaluation which has until now mostly focused on downstream classification tasks (Conneau et al, 2018;Ebrahimi et al, 2022;Clark et al, 2020;Liang et al, 2020;Hu et al, 2020;Raganato et al, 2020;Li et al, 2021). With the help of these evaluation methods, research has pointed out the problems for both high-and lowresource languages that come with adding many languages to a single model (Wang et al, 2020;Turc et al, 2021;Lauscher et al, 2020, inter alia), and proposed methods for more equitable models (Ansell et al, 2022;Pfeiffer et al, 2022;Ogueji et al, 2021;Ògúnrè . mí and Manning, 2023;Virtanen et al, 2019;Liang et al, 2023, inter alia).…”