Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop 2020
DOI: 10.18653/v1/2020.acl-srw.34
|View full text |Cite
|
Sign up to set email alerts
|

Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining

Abstract: Existing models of multilingual sentence embeddings require large parallel data resources which are not available for low-resource languages. We propose a novel unsupervised method to derive multilingual sentence embeddings relying only on monolingual data. We first produce a synthetic parallel corpus using unsupervised machine translation, and use it to fine-tune a pretrained cross-lingual masked language model (XLM) to derive the multilingual sentence representations. The quality of the representations is ev… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
43
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 28 publications
(45 citation statements)
references
References 24 publications
2
43
0
Order By: Relevance
“…En-Es Zh-Kz (Hangya and Fraser, 2019) 18.5 21.6 (Keung et al, 2020) 20.2 22.8 (Hangya et al, 2018) 16.3 19.3 (Kvapilíková et al, 2020) 23.6 22.7 Proposed method 24.3 25.8 We use openNMT 4 to train the machine translation system. The results are as in Table 3.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…En-Es Zh-Kz (Hangya and Fraser, 2019) 18.5 21.6 (Keung et al, 2020) 20.2 22.8 (Hangya et al, 2018) 16.3 19.3 (Kvapilíková et al, 2020) 23.6 22.7 Proposed method 24.3 25.8 We use openNMT 4 to train the machine translation system. The results are as in Table 3.…”
Section: Methodsmentioning
confidence: 99%
“…This transfer learning method inspired our work and the main difference is that they required bilingual supervision (e.g, bilingual lexicon, parallel sentences), which is not available for many low-resource language pairs. Recently, several works developed unsupervised method to mine parallel data (Hangya et al, 2018;Hangya and Fraser, 2019;Kvapilíková et al, 2020;Keung et al, 2020). These approaches mainly rely on unsupervised cross-lingual embeddings (Artetxe et al, 2018;Lample and Conneau, 2019) that be trained on monolingual corpora.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Baselines: In our experiments, we consider supervised baselines (Bouamor and Sajjad, 2018;Schwenk, 2018;Artetxe and Schwenk, 2019). We also compare several unsupervised baselines which contains (Hangya and Fraser, 2019;Keung et al, 2020;Hangya et al, 2018;Kvapilíková et al, 2020).…”
Section: En-frmentioning
confidence: 99%
“…However, their method is not unsupervised and relies on bilingual supervision (e.g, bilingual lexicon or sentences), which is not available for low-resource language pairs. Although (Kvapilíková et al, 2020) solved the supervised limitation by employing an unsupervised MT, the performance heavily depended on MT's quality.…”
Section: Introductionmentioning
confidence: 99%