2018
DOI: 10.3390/a11040039
|View full text |Cite
|
Sign up to set email alerts
|

On Hierarchical Text Language-Identification Algorithms

Abstract: Abstract:Text on the Internet is written in different languages and scripts that can be divided into different language groups. Most of the errors in language identification occur with similar languages. To improve the performance of short-text language identification, we propose four different levels of hierarchical language identification methods and conducted comparative tests in this paper. The efficiency of the algorithms was evaluated on sentences from 97 languages, and its macro-averaged F1-score reache… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 23 publications
0
3
0
Order By: Relevance
“…Hanif et al (2007) used the Unicode database to get the possible languages of individual Unicode characters. Lately, the knowledge of relevant alphabets has been used for LI also by Hasimu and Silamu (2017) and Samih (2017).…”
Section: Charactersmentioning
confidence: 99%
“…Hanif et al (2007) used the Unicode database to get the possible languages of individual Unicode characters. Lately, the knowledge of relevant alphabets has been used for LI also by Hasimu and Silamu (2017) and Samih (2017).…”
Section: Charactersmentioning
confidence: 99%
“…Their approach using words weighted with TF-IDF (product of term frequency and inverse document frequency) and SVMs reached 97.7% accuracy on the tweets when using only the provided tweet training corpora. Hasimu and Silamu (2018) included the same three languages in their test setting. They used a two-stage language identification system, where the languages were first identified as a group using Unicode code ranges.…”
Section: Language Identification For Devanagari Scriptmentioning
confidence: 99%
“…Their tests resulted in an F1-score of 0.993 within the group of languages using De-vanagari with 700 best distinguishing bigrams. Indhuja et al Indhuja et al (2014) provided test results for several different combinations of the five languages and for the set of languages used by Hasimu and Silamu Hasimu and Silamu (2018), they reached 96% accuracy with word unigrams. Rani et al Rani, Ojha, and Jha (2018) described a language identification system which they used for discriminating between Hindi and Magahi.…”
Section: Language Identification For Devanagari Scriptmentioning
confidence: 99%