2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech) 2017
DOI: 10.1109/robomech.2017.8261150
|View full text |Cite
|
Sign up to set email alerts
|

Improved text language identification for the South African languages

Abstract: The paper presents a hierarchical naive Bayesian and lexicon based classifier for short text language identification (LID) useful for under resourced languages. The algorithm is evaluated on short pieces of text for the 11 official South African languages some of which are similar languages. The algorithm is compared to recent approaches using test sets from previous works on South African languages as well as the Discriminating between Similar Languages (DSL) shared tasks' datasets. Remaining research opportu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
2
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(3 citation statements)
references
References 22 publications
0
2
0
Order By: Relevance
“…In the context of NLP, Almutir and Nadeem [4] use confusion matrices to evaluate the performance of named-entity recognition systems by analyzing the discrepancies between predicted and actual entity labels. Pienaar and Snyman use them for the identification of eleven official South African languages [5]. Abandah et al [6] use confusion matrix to correct spelling mistakes in Arabic with insufficient datasets to train the correction models.…”
Section: Confusion Matrixmentioning
confidence: 99%
“…In the context of NLP, Almutir and Nadeem [4] use confusion matrices to evaluate the performance of named-entity recognition systems by analyzing the discrepancies between predicted and actual entity labels. Pienaar and Snyman use them for the identification of eleven official South African languages [5]. Abandah et al [6] use confusion matrix to correct spelling mistakes in Arabic with insufficient datasets to train the correction models.…”
Section: Confusion Matrixmentioning
confidence: 99%
“…South African languages. This dataset was used by Duvenhage et al (2017) and consists of short texts for 11 South African languages, many of which are related; these are the Nguni languages (zul, xho, nbl, ssw), Afrikaans (afr) and English (en), the Sotho languages (nso, sot, tsn), tshiVenda (ven) and Xitsonga (tso). The texts are on average 15-20 characters long.…”
Section: Datamentioning
confidence: 99%
“…South African native languages do not have enough resources to be used to built such contextual Chatbots and Virtual Text Assistant. Therefore, the resources for native languages need to be created so that they can be used to build software agents that understand South African native languages (Duvenhage et al 2017).…”
Section: Introductionmentioning
confidence: 99%