Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.747
|View full text |Cite
|
Sign up to set email alerts
|

Unsupervised Cross-lingual Representation Learning at Scale

Abstract: This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of crosslingual transfer tasks. We train a Transformerbased masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

28
3,018
6
10

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 2,945 publications
(3,062 citation statements)
references
References 43 publications
28
3,018
6
10
Order By: Relevance
“…Initially, these joint models were trained using explicit supervision from sentence aligned data [31], but later it was discovered that merely training with a language modeling objective on a concatenation of raw corpora from multiple languages can yield multilingual representations [12,32]. This approach was later extended by incorporating more pretraining tasks [33,34] and even learning a hundred languages using a single model [13]. While these massively multilingual language models are effective at increasing the sample efficiency in low-resource languages, they are prohibitively expensive to train since the training cost increases linearly with the size of the data in use.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…Initially, these joint models were trained using explicit supervision from sentence aligned data [31], but later it was discovered that merely training with a language modeling objective on a concatenation of raw corpora from multiple languages can yield multilingual representations [12,32]. This approach was later extended by incorporating more pretraining tasks [33,34] and even learning a hundred languages using a single model [13]. While these massively multilingual language models are effective at increasing the sample efficiency in low-resource languages, they are prohibitively expensive to train since the training cost increases linearly with the size of the data in use.…”
Section: Related Workmentioning
confidence: 99%
“…While these massively multilingual language models are effective at increasing the sample efficiency in low-resource languages, they are prohibitively expensive to train since the training cost increases linearly with the size of the data in use. Further, learning from many languages requires the model to have higher capacity [13]. This leads to difficulties when trying to adapt this method to more efficient and capable architectures or deploy to devices with limited computing resources.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Note that our suggested methods and our analyses in Section 5 do not relate to the size of the model (i.e., the hidden size or the number of layers). We strongly believe that our methods can be applied to larger language models such as BERT [2] or XLM-R [33], because they also exploit the multi-head attention as the same way as the Transformer model in our experiments.…”
Section: Quantitative Evaluationmentioning
confidence: 99%