A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

Adelani, David Ifeoluwa; Alabi, Jesujoba Oluwadara; Fan, Angela; Kreutzer, Julia; Shen, Xiaoyu; Reid, Machel; Ruiter, Dana; Klakow, Dietrich; Nabende, Peter; Chang, Ernie; Tajuddeen, Gwadabe,; Sackey, Freshia; Dossou, Bonaventure F. P.; Emezue, Chris Chinenye; Leong, Colin; Beukman, Michael; Muhammad, Shamsuddeen Hassan; Jarso, Guyo Dub; Oreen, Yousuf,; Rubungo, Andre Niyongabo; Gilles, Hacheme,; Wairagala, Eric Peter; Umair, Nasir, Muhammad; Ajibade, Benjamin Ayoade; Oluwaseyi, Ajayi, Tunde; Gitau, Yvonne Wambui; Abbott, Jade; Ahmed, Mohamed; Millicent, Ochieng,; Aremu, Anuoluwapo; Perez, Ogayo,; Mukiibi, Jonathan; Ouoba, Kabore, Fatoumata; Kalipe, Godson Koffi; Mbaye, Derguene; Tapo, Allahsera Auguste; Memdjokam, Koagne, Victoire; Edwin, Munkoh-Buabeng,; Wagner, Valencia; Abdulmumin, Idris; Awokoya, Ayodele; Buzaaba, Happy; Andiswa, Bukula,; Manthalu, Sam

doi:10.48550/arxiv.2205.02022

Cited by 2 publications

(3 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Masakhane has created and publicly released several MT datasets and baseline models. 1 The most notable is MAFAND-MT, a news domain parallel corpus for 16 African languages [26]. Other efforts under Masakhane have been translating Edoid languages, MT from Fon to French [27], MT for Nigerian Pidgin [28], and many others.…”

Section: Text Datasetsmentioning

confidence: 99%

“…Linguists from the languagespeaking communities led by the language coordinator provided quality control through discussion and fixtures of problematic translations. They also performed multiple checks to find and correct the misspellings in the dataset, which is a similar approach in other translation projects [26]. Each language had a lead responsible for ensuring the quality and approval of the final translations.…”

Section: Quality Assurancementioning

confidence: 99%

See 1 more Smart Citation

Building Text and Speech Benchmark Datasets and Models for Low‐Resourced East African Languages: Experiences and Lessons

Nakatumba‐Nabende,

Babirye,

Nabende

et al. 2024

Applied AI Letters

View full text Add to dashboard Cite

Africa has over 2000 languages; however, those languages are not well represented in the existing natural language processing ecosystem. African languages lack essential digital resources to effectively engage in advancing language technologies. There is a need to generate high‐quality natural language processing resources for low‐resourced African languages. Obtaining high‐quality speech and text data is expensive and tedious because it can involve manual sourcing and verification of data sources. This paper discusses the process taken to curate and annotate text and speech datasets for five East African languages: Luganda, Runyankore‐Rukiga, Acholi, Lumasaba, and Swahili. We also present results obtained from baseline models for machine translation, topic modeling and classification, sentiment classification, and automatic speech recognition tasks. Finally, we discuss the experiences, challenges, and lessons learned in creating the text and speech datasets.

show abstract

Section: Text Datasetsmentioning

confidence: 99%

Section: Quality Assurancementioning

confidence: 99%

Building Text and Speech Benchmark Datasets and Models for Low‐Resourced East African Languages: Experiences and Lessons

Nakatumba‐Nabende,

Babirye,

Nabende

et al. 2024

Applied AI Letters

View full text Add to dashboard Cite

show abstract

“…Its encoding layer employs a self-attention mechanism and significantly improved performance compared to the RNN method. Subsequently, an increasing number of NLP tasks use methods based on pre-trained models, including named entity recognition [9][10][11], machine translation [12,13], and machine reading comprehension [14][15][16][17][18].…”

Section: Introductionmentioning

confidence: 99%

DaGATN: A Type of Machine Reading Comprehension Based on Discourse-Apperceptive Graph Attention Networks

Wu,

Sun,

Wang

et al. 2023

Applied Sciences

View full text Add to dashboard Cite

In recent years, with the advancement of natural language processing techniques and the release of models like ChatGPT, how language models understand questions has become a hot topic. In handling complex logical reasoning with pre-trained models, its performance still has room for improvement. Inspired by DAGN, we propose an improved DaGATN (Discourse-apperceptive Graph Attention Networks) model. By constructing a discourse information graph to learn logical clues in the text, we decompose the context, question, and answer into elementary discourse units (EDUs) and connect them with discourse relations to construct a relation graph. The text features are learned through a discourse graph attention network and applied to downstream multiple-choice tasks. Our method was evaluated on the ReClor dataset and achieved an accuracy of 74.3%, surpassing the best-known performance methods utilizing deberta-xlarge-level pre-trained models, and also performed better than ChatGPT (Zero-Shot).

show abstract

A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

Cited by 2 publications

References 22 publications

Building Text and Speech Benchmark Datasets and Models for Low‐Resourced East African Languages: Experiences and Lessons

Building Text and Speech Benchmark Datasets and Models for Low‐Resourced East African Languages: Experiences and Lessons

DaGATN: A Type of Machine Reading Comprehension Based on Discourse-Apperceptive Graph Attention Networks

Contact Info

Product

Resources

About