Learning morphologically supplemented embedding spaces using cross-lingual models has become an active area of research and facilitated many research breakthroughs in various applications such as machine translation, named entity recognition, document classification, and natural language inference. However, the field has not become customary for Southern African low-resourced languages. In this paper, we present, evaluate and benchmark a cohort of cross-lingual embeddings for the English-Southern African languages on two classification tasks: News Headlines Classification (NHC) and Named Entity Recognition (NER). Our methodology considers four agglutinative languages from the eleven official South African languages: Isixhosa, Sepedi, Sesotho, and Setswana. Canonical correlation analyses and VecMap are the two cross-lingual alignment strategies adopted for this study. Monolingual embeddings used in this work are Glove (source), and FastText (source and target) embeddings. Our results indicate that with enough comparable corpora, we can develop strong inter-joined representations between English and the considered Southern African languages. More specifically, the best zero-shot transfer results on the available Setswana NHC dataset were achieved using canonically correlated embeddings with Multi-layered perceptron as the training model (54.5% accuracy). Furthermore, our NER best performance was achieved using canonically correlated cross-lingual embeddings with Conditional Random Fields as the training model (96.4% F1 score). Collectively, this study’s results were competitive with the benchmarks of the explored NHC and NER datasets, on both zero-short NHC and NER tasks with our advantage being the use of very minimal resources.
Low-resource languages pose a particularly difficult challenge to neu-ral machine translation (NMT), and there appears to be insufficient machine translation (MT) systems to support African language accessibility. Masakhane Web, an NMT system for African languages, is proposed in this paper. Our approach is an open-source platform that is free, flexible, and produces reasonably accurate translations for African languages. The platform makes use of Masakhane community-trained MT models. It enables users to generate new data by providing feedback on translations, which is then used to retrain the models to improve them. Ultimately, our goal is to create a platform that can provide accurate translations for African languages and make the process of creating MT models easier for those who lack the technical expertise. Furthermore, we include strategies for domain experts to evaluate the system and explain how the platform can be used as a data collection source to improve MT for African languages.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.