2009
DOI: 10.4038/icter.v2i2.1385
|View full text |Cite
|
Sign up to set email alerts
|

Optimizing n‑gram Order of an n‑gram Based Language Identification Algorithm for 68 Written Languages

Abstract: Language identification technology is widely used in the domains of machine learning and text mining. Many researchers have achieved excellent results on a few selected European languages. However, the majority of African and Asian languages remain untested. The primary objective of this research is to evaluate the performance of our new n-gram based language identification algorithm on 68 written languages used in the European, African and Asian regions. The secondary objective is to evaluate how n-gram order… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2013
2013
2021
2021

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(3 citation statements)
references
References 4 publications
0
3
0
Order By: Relevance
“…Also, previous research has shown that performances of similar experiments using different values of n usually peak at 3, i.e. trigrams (Choong, Mikami, Marasinghe, & Nandasara, ; Suzuki, Mikami, Ohsato, & Chubachi, ; Jhamtani, Bhogi, & Raychoudhury, ; Chittaranjan, Vyas, Bali, & Choudhury, ).…”
Section: Methodsmentioning
confidence: 99%
“…Also, previous research has shown that performances of similar experiments using different values of n usually peak at 3, i.e. trigrams (Choong, Mikami, Marasinghe, & Nandasara, ; Suzuki, Mikami, Ohsato, & Chubachi, ; Jhamtani, Bhogi, & Raychoudhury, ; Chittaranjan, Vyas, Bali, & Choudhury, ).…”
Section: Methodsmentioning
confidence: 99%
“…All the n-gram-based ap-proaches differ primarily in the calculation of the distance between the n-gram profile of the training and test text (Selamat, 2011;Yang and Liang, 2010), or by using additional features on top of the n-gram profiles (Padma et al, 2009;Carter et al, 2013). One of the fairly robust definitions of the distance (or similarity) was proposed by Choong et al (2009) who simply check the proportion of n-gram types seen in the tested document of the most frequent n-gram types extracted from training documents for each language. The highestscoring language is then returned.…”
Section: Related Workmentioning
confidence: 99%
“…It is a process that attempts to classify text in a language into a pre-defined set of known languages [1]. LI is the first step in text mining, information retrieval, speech processing and machine translation [2][3][4].…”
Section: Introductionmentioning
confidence: 99%