2020
DOI: 10.48550/arxiv.2003.04986
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

1
16
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 6 publications
(19 citation statements)
references
References 0 publications
1
16
0
Order By: Relevance
“…A trivial attempt to address this imbalance is to simply redirect research attention to the so-called resourced-deprived languages. However, such initiatives were less successful due to the high cost of creating, curating, and annotating quality datasets for low-resourced languages 11 . Instead, a research focus on using the morpho-syntactic embedding spaces of these available highresourced languages to supplement language-generic knowledge encoded in embedding spaces of low-resourced languages became an active research alternative: a field of extracting cross-language (CL) embeddings.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…A trivial attempt to address this imbalance is to simply redirect research attention to the so-called resourced-deprived languages. However, such initiatives were less successful due to the high cost of creating, curating, and annotating quality datasets for low-resourced languages 11 . Instead, a research focus on using the morpho-syntactic embedding spaces of these available highresourced languages to supplement language-generic knowledge encoded in embedding spaces of low-resourced languages became an active research alternative: a field of extracting cross-language (CL) embeddings.…”
Section: Introductionmentioning
confidence: 99%
“…This means that the majority of the languages unable to meet these requirements are still not alleviated from the challenge of under-representation 45 . South African languages reserving a seat amongst the cohort unable to meet the aforementioned prerequisites 10,11,46 . For this, we aim to explore News Headlines Classification (NHC) 11 and Named Entity Recognition (NER) 47 downstream tasks on four agglutinative of the 11 official South African languages: Isixhosa (languages of the Nguni tribe), Sesotho, Setswana, and Sepedi (three languages of the Sotho-Tswana language family).…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…The development, training, and evaluation A word embedding trained on South African news data of word embedding models must therefore be context-specific. Examples of word embeddings linked to a certain domain are: the NukeBERT model ( Jain et al, 2020) that is trained on texts from the nuclear and atomic energy section; specialised embeddings for finance (Theil et al, 2020); and embeddings trained on certain languages, such as Setswana and Sepedi (Marivate et al, 2020) or Croatian (Svoboda & Beliga, 2017).…”
Section: Introductionmentioning
confidence: 99%
“…The word embedding we generated is publicly available via a github repository. 1 It is, to the best of our knowledge, the first publicly available word embedding trained on South Africa news article data, and thus forms a valuable addition to the field of NLP in African contexts (Marivate et al, 2020). The embedding will allow researchers to investigate the meanings of numerous words from within a South African context and to seek answers to culturally or politically oriented South African research questions-such as, to give but one small example, how the African National Congress (ANC) and Democratic Alliance (DA) relate to terms such as "corruption" and "white monopoly capital".…”
Section: Introductionmentioning
confidence: 99%