2022
DOI: 10.1162/tacl_a_00452
|View full text |Cite
|
Sign up to set email alerts
|

Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

Abstract: We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages (from two language families). Specifically, we compile 12.4 million sentence pairs from existing, publicly available parallel corpora, and additionally mine 37.4 million sentence pairs from the Web, resulting in a 4× increase. We mine the parallel sentences from the Web by combining many corpora, tools, and m… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 68 publications
(26 citation statements)
references
References 36 publications
0
18
0
Order By: Relevance
“…In the set I, parallel corpora were collected from Samanantar 8 -a collection of the largest parallel corpora available for Indic languages (Ramesh et al, 2022) and statistics of set I is shown in Table 1. It may be noted that only a small portion is used in this task instead of using whole dataset.…”
Section: Datasetmentioning
confidence: 99%
“…In the set I, parallel corpora were collected from Samanantar 8 -a collection of the largest parallel corpora available for Indic languages (Ramesh et al, 2022) and statistics of set I is shown in Table 1. It may be noted that only a small portion is used in this task instead of using whole dataset.…”
Section: Datasetmentioning
confidence: 99%
“…Thus, fact verification in a bilingual setting wherein, the premise is in English and the claim/hypothesis is in an Indic language, is of great significance. Moreover, recent advances in multilingual language models (Khanuja et al, 2021a;Kunchukuttan, 2020), datasets (Roark et al, 2020;Ramesh et al, 2022), and translation systems (Ramesh et al, 2022) for Indian languages have enabled quality examination of several Indic NLU tasks which serves as additional motivation to evaluate the task of bTNLI for Indic languages before other low resource languages.…”
Section: Motivationmentioning
confidence: 99%
“…To construct EI-INFOTABS, we machine translated the English hypotheses provided in INFOTABS to 11 major Indian languages as described earlier. We use IndicTrans (Ramesh et al, 2022), an open-sourced state-of-the-art Indic NMT model. IndicTrans is trained on the Samanantar dataset (Ramesh et al, 2022), which is the largest publicly available parallel corpus for Indic languages.…”
Section: Ei-infotabs Constructionmentioning
confidence: 99%
See 1 more Smart Citation
“…Training on fairseq enables competent batching, mixed-precision training, multi-GPU and multi-machine training. IndicTrans (Ramesh et al, 2021), a Transformer-4x multilingual NMT model by AI4Bharat, is trained on the Samanantar dataset. The architecture of our approach is displayed in Fig.…”
Section: Indictransmentioning
confidence: 99%