Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.156
|View full text |Cite
|
Sign up to set email alerts
|

A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

Abstract: We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCARbased and Wikipedia-based ELMo embeddings for these languages on the part-ofspeech tagging and parsing tasks. We show that, despite the noise in the Common-Crawlbased OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained o… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
29
0
2

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
3

Relationship

0
10

Authors

Journals

citations
Cited by 66 publications
(31 citation statements)
references
References 24 publications
0
29
0
2
Order By: Relevance
“…We sample 10M EN-RU sentences from the WMT'19 shared task (Ma et al, 2019), and 80M RU sentences from the CoNLL'17 shared task to train embeddings. To simulate lowresource scenarios, we sample 10K, 100K and 1M UK sentences from the CoNLL'17 shared task and BE sentences from the OSCAR corpus (Ortiz Suárez et al, 2020). We use TED dev/test sets for both languages pairs (Cettolo et al, 2012).…”
Section: Methodsmentioning
confidence: 99%
“…We sample 10M EN-RU sentences from the WMT'19 shared task (Ma et al, 2019), and 80M RU sentences from the CoNLL'17 shared task to train embeddings. To simulate lowresource scenarios, we sample 10K, 100K and 1M UK sentences from the CoNLL'17 shared task and BE sentences from the OSCAR corpus (Ortiz Suárez et al, 2020). We use TED dev/test sets for both languages pairs (Cettolo et al, 2012).…”
Section: Methodsmentioning
confidence: 99%
“…We fine-tune three adaptations of BERT (Devlin et al, 2019): mBERT, trained by the original authors on a corpus consisting of the entire Wikipedia dumps of 100 languages; HeBERT (Chriqui and Yahav, 2021), trained on the OSCAR corpus (Ortiz Suárez et al, 2020) and Hebrew Wikipedia; AlephBERT (Seker et al, 2021), also trained on the OSCAR corpus, with an additional 71.5 million tweets in Hebrew. All models are equivalent in size to BERT-base, i.e.…”
Section: Experiments Setupmentioning
confidence: 99%
“…NASCA was pretrained with the documents of the Catalan training set of the DACSA corpus (including some documents discarded in the corpora creation process [15]), the Catalan subset of the OSCAR corpus [31], and the dump from 20 April 2021 of the Catalan version of the Wikipedia. In total, 9.3 GB of raw text (2.5 millions of documents) were used to pretrain it.…”
Section: Summarization Modelsmentioning
confidence: 99%