Findings of the Association for Computational Linguistics: EACL 2023 2023
DOI: 10.18653/v1/2023.findings-eacl.93
|View full text |Cite
|
Sign up to set email alerts
|

Mini But Mighty: Efficient Multilingual Pretraining with Linguistically-Informed Data Selection

Tolulope Ogunremi,
Dan Jurafsky,
Christopher Manning

Abstract: With the prominence of large pretrained language models, low-resource languages are rarely modelled monolingually and become victims of the "curse of multilinguality" in massively multilingual models. Recently, Afri-BERTa showed that training transformer models from scratch on 1GB of data from many unrelated African languages outperforms massively multilingual models on downstream NLP tasks. Here we extend this direction, focusing on the use of related languages. We propose that training on smaller amounts of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(2 citation statements)
references
References 14 publications
0
2
0
Order By: Relevance
“…However, Wu and Dredze (2020) show that these massively multilingual models still underperform on lower-resource languages. Recent efforts to cover these languages instead pre-train models that are specialized to specific languages or language families (Ogueji et al, 2021;Ogunremi et al, 2023). These approaches nonetheless require training a new model from scratch and do not leverage transferable information in existing models.…”
Section: Introductionmentioning
confidence: 99%
“…However, Wu and Dredze (2020) show that these massively multilingual models still underperform on lower-resource languages. Recent efforts to cover these languages instead pre-train models that are specialized to specific languages or language families (Ogueji et al, 2021;Ogunremi et al, 2023). These approaches nonetheless require training a new model from scratch and do not leverage transferable information in existing models.…”
Section: Introductionmentioning
confidence: 99%
“…Our work builds off of a long literature on multilingual evaluation which has until now mostly focused on downstream classification tasks (Conneau et al, 2018;Ebrahimi et al, 2022;Clark et al, 2020;Liang et al, 2020;Hu et al, 2020;Raganato et al, 2020;Li et al, 2021). With the help of these evaluation methods, research has pointed out the problems for both high-and lowresource languages that come with adding many languages to a single model (Wang et al, 2020;Turc et al, 2021;Lauscher et al, 2020, inter alia), and proposed methods for more equitable models (Ansell et al, 2022;Pfeiffer et al, 2022;Ogueji et al, 2021;Ògúnrè . mí and Manning, 2023;Virtanen et al, 2019;Liang et al, 2023, inter alia).…”
Section: Introductionmentioning
confidence: 99%