Improving Zero-shot Cross-lingual Transfer between Closely Related Languages by injecting Character-level Noise

Aepli, Noëmi; Sennrich, Rico

doi:10.48550/arxiv.2109.06772

Cited by 1 publication

(1 citation statement)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, our sample of target languages does not include any Indo-European languages, such as Germanic or Romance low-resource languages. These languages have been studied before and it has been shown that the best choice for them is transferring from a genealogically related rich-resource language (Aepli and Sennrich, 2021). It might be interesting to see how our proposed measure would compare with other measures in these cases, but this would require a different study design, which we leave for future work.…”

Section: Limitationsmentioning

confidence: 99%

Subword Evenness (SuE) as a Predictor of Cross-lingual Transfer to Low-resource Languages

Pelloni,

Shaitarova,

Samardzic

2022

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Pre-trained multilingual models, such as mBERT, XLM-R and mT5, are used to improve the performance on various tasks in lowresource languages via cross-lingual transfer. In this framework, English is usually seen as the most natural choice for a transfer language (for fine-tuning or continued training of a multilingual pre-trained model), but it has been revealed recently that this is often not the best choice. The success of cross-lingual transfer seems to depend on some properties of languages, which are currently hard to explain. Successful transfer often happens between unrelated languages and it often cannot be explained by data-dependent factors. In this study, we show that languages written in non-Latin and non-alphabetic scripts (mostly Asian languages) are the best choices for improving performance on the task of Masked Language Modelling (MLM) in a diverse set of 30 low-resource languages and that the success of the transfer is well predicted by our novel measure of Subword Evenness (SuE). Transferring language models over the languages that score low on our measure results in the lowest average perplexity over target low-resource languages. Our correlation coefficients obtained with three different pre-trained multilingual models are consistently higher than all the other predictors, including text-based measures (type-token ratio, entropy) and linguistically motivated choice (genealogical and typological proximity).

show abstract

Section: Limitationsmentioning

confidence: 99%