Mini But Mighty: Efficient Multilingual Pretraining with Linguistically-Informed Data Selection

Ogunremi, Tolulope; Jurafsky, Dan; Manning, Christopher

doi:10.18653/v1/2023.findings-eacl.93

Cited by 3 publications

(2 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, Wu and Dredze (2020) show that these massively multilingual models still underperform on lower-resource languages. Recent efforts to cover these languages instead pre-train models that are specialized to specific languages or language families (Ogueji et al, 2021;Ogunremi et al, 2023). These approaches nonetheless require training a new model from scratch and do not leverage transferable information in existing models.…”

Section: Introductionmentioning

confidence: 99%

Embedding Structure Matters: Comparing Methods to Adapt Multilingual Vocabularies to New Languages

Downey,

Blevins,

Goldfine

et al. 2023

Proceedings of the 3rd Workshop on Multi-Lingual Representation Learning (MRL)

View full text Add to dashboard Cite

Pre-trained multilingual language models underpin a large portion of modern NLP tools outside of English. A strong baseline for specializing these models for specific languages is Language-Adaptive Pre-Training (LAPT). However, retaining a large cross-lingual vocabulary and embedding matrix comes at considerable excess computational cost during adaptation. In this study, we propose several simple techniques to replace a cross-lingual vocabulary with a compact, language-specific one. Namely, we address strategies for re-initializing the token embedding matrix after vocabulary specialization. We then provide a systematic experimental comparison of our techniques, in addition to the recently-proposed FOCUS method. We demonstrate that: 1) Embeddingreplacement techniques in the monolingual transfer literature are inadequate for adapting multilingual models. 2) Replacing crosslingual vocabularies with smaller specialized ones provides an efficient method to improve performance in low-resource languages. 3) Simple embedding re-initialization techniques based on script-wise sub-distributions rival techniques such as FOCUS, which rely on similarity scores obtained from an auxiliary model.

show abstract

Section: Introductionmentioning

confidence: 99%

Embedding Structure Matters: Comparing Methods to Adapt Multilingual Vocabularies to New Languages

Downey,

Blevins,

Goldfine

et al. 2023

Proceedings of the 3rd Workshop on Multi-Lingual Representation Learning (MRL)

View full text Add to dashboard Cite

show abstract

“…Our work builds off of a long literature on multilingual evaluation which has until now mostly focused on downstream classification tasks (Conneau et al, 2018;Ebrahimi et al, 2022;Clark et al, 2020;Liang et al, 2020;Hu et al, 2020;Raganato et al, 2020;Li et al, 2021). With the help of these evaluation methods, research has pointed out the problems for both high-and lowresource languages that come with adding many languages to a single model (Wang et al, 2020;Turc et al, 2021;Lauscher et al, 2020, inter alia), and proposed methods for more equitable models (Ansell et al, 2022;Pfeiffer et al, 2022;Ogueji et al, 2021;Ògúnrè . mí and Manning, 2023;Virtanen et al, 2019;Liang et al, 2023, inter alia).…”

Section: Introductionmentioning

confidence: 99%

Multilingual BERT has an Accent: Evaluating English Influences on Fluency in Multilingual Models

Papadimitriou,

Lopez,

Jurafsky

2023

Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

View full text Add to dashboard Cite

Advancements in Natural Language Understanding- Driven Machine Translation: Focus on English and the Low Resource Dialectal Lusoga

Wasike,

Kamukama,

Abass Aleshinloye

et al. 2024

International Journal of Innovative Science and Research Techno

View full text Add to dashboard Cite

This review explores recent advancements in Natural Language Understanding-driven Machine Translation (NLU-MT) with a focus on English and the low-resource dialectal Lusoga. A Low-resource language, such as Lusoga, faces significant challenges in Machine Translation (MT) due to the scarcity of high-quality parallel corpora, the complex morphology inherent in Bantu languages, and the dialectal variations within Lusoga itself, particularly between Lutenga and Lupakoyo. This paper examines the role of NLU-based MT systems in overcoming these challenges by shifting from word-for-word mapping to meaning-based translations, enabling better handling of these dialectal differences. We highlight the success of leveraging linguistic similarities between Lusoga and related languages, such as Luganda, to improve translation performance through multilingual transfer learning techniques. Key advancements include the use of transformer-based architectures such as Multilingual Bidirectional and Auto-Regressive Transformer (mBART) and Multilingual Text-To-Text Transfer Transformer (mT5), specifically selected for their effectiveness in NLU-driven contexts, which have shown promise in enhancing translation accuracy for African low-resource languages. However, the review also identifies ongoing obstacles, including historical low demand and the lack of well-developed corpora, which hinder scalability. The paper concludes by emphasizing the potential of hybrid approaches that combine community-driven corpus-building initiatives with improved model architectures to drive further progress in low-resource MT. Ultimately, NLU-MT is positioned as a crucial tool not only for bridging communication gaps but also for preserving linguistic diversity and cultural heritage.

show abstract

Mini But Mighty: Efficient Multilingual Pretraining with Linguistically-Informed Data Selection

Cited by 3 publications

References 14 publications

Embedding Structure Matters: Comparing Methods to Adapt Multilingual Vocabularies to New Languages

Embedding Structure Matters: Comparing Methods to Adapt Multilingual Vocabularies to New Languages

Multilingual BERT has an Accent: Evaluating English Influences on Fluency in Multilingual Models

Advancements in Natural Language Understanding- Driven Machine Translation: Focus on English and the Low Resource Dialectal Lusoga

Contact Info

Product

Resources

About