Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer 2021
DOI: 10.18653/v1/2021.acl-long.105
|View full text |Cite
|
Sign up to set email alerts
|

Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study

Abstract: Recent research in multilingual language models (LM) has demonstrated their ability to effectively handle multiple languages in a single model. This holds promise for low web-resource languages (LRL) as multilingual models can enable transfer of supervision from high resource languages to LRLs. However, incorporating a new language in an LM still remains a challenge, particularly for languages with limited corpora and in unseen scripts. In this paper we argue that relatedness among languages in a language fami… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
26
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
4

Relationship

1
7

Authors

Journals

citations
Cited by 25 publications
(26 citation statements)
references
References 24 publications
0
26
0
Order By: Relevance
“…(Amrhein and Sennrich, 2020) studied how transliter-ation improved NMT and came to the conclusion that transliteration offered significant improvement for low-resource languages with different scripts. (Khemchandani et al, 2021) showed on Indo-Aryan languages that language relatedness could be exploited through transliteration along with bilingual lexicon-based pseudo-translation and aligned loss to incorporate low-resource languages into pretrained mBERT. (Muller et al, 2021a) showed that for unseen languages, script barrier hindered transfer between low-resource and high-resource languages for MLLMs and transliteration removed this barrier.…”
Section: Motivation and Backgroundmentioning
confidence: 99%
“…(Amrhein and Sennrich, 2020) studied how transliter-ation improved NMT and came to the conclusion that transliteration offered significant improvement for low-resource languages with different scripts. (Khemchandani et al, 2021) showed on Indo-Aryan languages that language relatedness could be exploited through transliteration along with bilingual lexicon-based pseudo-translation and aligned loss to incorporate low-resource languages into pretrained mBERT. (Muller et al, 2021a) showed that for unseen languages, script barrier hindered transfer between low-resource and high-resource languages for MLLMs and transliteration removed this barrier.…”
Section: Motivation and Backgroundmentioning
confidence: 99%
“…Input Data In the data creation stage, Conneau et al (2020a) propose over-sampling of LRL documents to improve LRL representation in the vocabulary and pre-training steps. Khemchandani et al (2021) specifically target related languages and propose transliteration of LRL documents to the script of related HRL for greater lexical overlap. We deploy both these tricks in this paper.…”
Section: Related Workmentioning
confidence: 99%
“…(Maronikolakis et al, 2021) targets tok-enization compatibility based purely on vocabulary size and does not focus on choosing the tokens that go in the vocabulary. Pre-Training and Adaptation Several previous works have proposed to include additional alignment loss between parallel or pseudo-parallel (Khemchandani et al, 2021) sentences to co-embed HRLs and LRLs. Another approach is to design language-specific Adapter layers (Pfeiffer et al, 2020a,b;Artetxe et al, 2020;Üstün et al, 2020) that can be easily fine-tuned for each new language.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Previous work has shown the importance of language relatedness for cross-lingual transfer Aharoni et al, 2019;Kudugunta et al, 2019). In the case of Indic languages, orthographic similarity between Indic languages has been utilized to represent data in a common script and improve cross-lingual transfer (Dabre et al, 2018;Goyal et al, 2020b;Khemchandani et al, 2021).…”
Section: Related Workmentioning
confidence: 99%