2021
DOI: 10.48550/arxiv.2104.05596
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

Abstract: We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 46.9 million sentence pairs between English and 11 Indic languages (from two language families). In particular, we compile 12.4 million sentence pairs from existing, publiclyavailable parallel corpora, and we additionally mine 34.6 million sentence pairs from the web, resulting in a 2.8× increase in publicly available sentence pairs. We mine the parallel sentences from the w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 8 publications
(7 citation statements)
references
References 26 publications
0
7
0
Order By: Relevance
“…MT systems are generally modeled as monolingual or multilingual models based on the (i) availability of large training corpus (ii) similarities between languages. [5,73,84].…”
Section: Machine Translationmentioning
confidence: 99%
“…MT systems are generally modeled as monolingual or multilingual models based on the (i) availability of large training corpus (ii) similarities between languages. [5,73,84].…”
Section: Machine Translationmentioning
confidence: 99%
“…Datasets We will consider a total of 51 Englishcentric language pairs. 6 We gather this data from WMT, the OPUS 100 (Zhang et al, 2020) dataset, Paracrawl 7 and Samanantar (Ramesh et al, 2021).…”
Section: Model Configurationsmentioning
confidence: 99%
“…Weakly Labeled Data Generation: We propose an annotation-projection-based approach to create weakly labeled data from a large bi-lingual parallel corpus (Ramesh et al 2021). Let {(S i , T i )} N i=1 having N sentences be a parallel corpus of English and a low-resource language, where (S i , T i ) represents a pair of sentences which are translations of each other.…”
Section: Proposed Approachmentioning
confidence: 99%