Learning morphologically supplemented embedding spaces using
cross-lingual models has become an active area of research and
facilitated many research breakthroughs in various applications such as
machine translation, named entity recognition, document classification,
and natural language inference. However, the field has not become
customary for Southern African low-resourced languages. In this paper,
we present, evaluate and benchmark a cohort of cross-lingual embeddings
for the English-Southern African languages on two classification tasks:
News Headlines Classification (NHC) and Named Entity Recognition (NER).
Our methodology considers four agglutinative languages from the eleven
official South African languages: Isixhosa, Sepedi, Sesotho, and
Setswana. Canonical correlation analyses and VecMap are the two
cross-lingual alignment strategies adopted for this study. Monolingual
embeddings used in this work are Glove (source), and FastText (source
and target) embeddings. Our results indicate that with enough comparable
corpora, we can develop strong inter-joined representations between
English and the considered Southern African languages. More
specifically, the best zero-shot transfer results on the available
Setswana NHC dataset were achieved using canonically correlated
embeddings with Multi-layered perceptron as the training model (54.5%
accuracy). Furthermore, our NER best performance was achieved using
canonically correlated cross-lingual embeddings with Conditional Random
Fields as the training model (96.4% F1 score). Collectively, this
study’s results were competitive with the benchmarks of the explored NHC
and NER datasets, on both zero-short NHC and NER tasks with our
advantage being the use of very minimal resources.