2022
DOI: 10.48550/arxiv.2205.02022
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

Abstract: Recent advances in the pre-training of language models leverage large-scale datasets to create multilingual models. However, lowresource languages are mostly left out in these datasets. This is primarily because many widely spoken languages are not well represented on the web and therefore excluded from the large-scale crawls used to create datasets. Furthermore, downstream users of these models are restricted to the selection of languages originally chosen for pre-training. This work investigates how to optim… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 22 publications
0
2
0
Order By: Relevance
“…Masakhane has created and publicly released several MT datasets and baseline models. 1 The most notable is MAFAND-MT, a news domain parallel corpus for 16 African languages [26]. Other efforts under Masakhane have been translating Edoid languages, MT from Fon to French [27], MT for Nigerian Pidgin [28], and many others.…”
Section: Text Datasetsmentioning
confidence: 99%
See 1 more Smart Citation
“…Masakhane has created and publicly released several MT datasets and baseline models. 1 The most notable is MAFAND-MT, a news domain parallel corpus for 16 African languages [26]. Other efforts under Masakhane have been translating Edoid languages, MT from Fon to French [27], MT for Nigerian Pidgin [28], and many others.…”
Section: Text Datasetsmentioning
confidence: 99%
“…Linguists from the languagespeaking communities led by the language coordinator provided quality control through discussion and fixtures of problematic translations. They also performed multiple checks to find and correct the misspellings in the dataset, which is a similar approach in other translation projects [26]. Each language had a lead responsible for ensuring the quality and approval of the final translations.…”
Section: Quality Assurancementioning
confidence: 99%
“…Its encoding layer employs a self-attention mechanism and significantly improved performance compared to the RNN method. Subsequently, an increasing number of NLP tasks use methods based on pre-trained models, including named entity recognition [9][10][11], machine translation [12,13], and machine reading comprehension [14][15][16][17][18].…”
Section: Introductionmentioning
confidence: 99%