Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume 2021
DOI: 10.18653/v1/2021.eacl-main.115
|View full text |Cite
|
Sign up to set email alerts
|

WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia

Abstract: We present an approach based on multilingual sentence embeddings to automatically extract parallel sentences from the content of Wikipedia articles in 96 languages, including several dialects or low-resource languages. We systematically consider all possible language pairs. In total, we are able to extract 135M parallel sentences for 1620 different language pairs, out of which only 34M are aligned with English. This corpus is freely available. 1 To get an indication on the quality of the extracted bitexts, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
72
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 91 publications
(72 citation statements)
references
References 30 publications
0
72
0
Order By: Relevance
“…The most recent initiative is the so-called LASER (Artetxe and Schwenk, 2019b), which relies on vector representations of sentences to extract similar pairs. This toolkit has been used to extract the WikiMatrix corpus (Schwenk et al, 2019) which contains 135 million parallel sentences for 1,620 different language pairs in 85 different languages.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…The most recent initiative is the so-called LASER (Artetxe and Schwenk, 2019b), which relies on vector representations of sentences to extract similar pairs. This toolkit has been used to extract the WikiMatrix corpus (Schwenk et al, 2019) which contains 135 million parallel sentences for 1,620 different language pairs in 85 different languages.…”
Section: Related Workmentioning
confidence: 99%
“…The corpus alignment module makes use of the text and dictionaries retrieved in the previous step and the LASER toolkit 7 . LASER (Language-Agnostic SEntence Representations) allows to obtain sentence embeddings through a multilingual sentence encoder (Schwenk et al, 2019). Translations can be found then as close pairs (tuples) in the multilingual semantic space.…”
Section: Base Architecturementioning
confidence: 99%
See 1 more Smart Citation
“…Curating such datasets relies on the Web sites giving clues about the language of their contents (e.g., a language identifier in the URL) and on automatic language classification (LangID). It is commonly known that these automatically crawled and filtered datasets tend to have overall lower quality than hand-curated collections , but their quality is rarely measured directly, and is rather judged through the improvements they bring to downstream applications (Schwenk et al, 2021).…”
Section: Introductionmentioning
confidence: 99%
“…To shed light on the quality of data crawls for the lowest resource languages, we perform a manual data audit for 230 per-language subsets of five major crawled multilingual datasets: 1 CCAligned (El- Kishky et al, 2020), ParaCrawl (Esplà et al, 2019;Bañón et al, 2020), Wiki-Matrix (Schwenk et al, 2021), OSCAR (Ortiz Suárez et al, 2019;, and mC4 (Xue et al, 2021). We propose solutions for effective, low-effort data auditing (Section 4), including an error taxonomy.…”
Section: Introductionmentioning
confidence: 99%