2011
DOI: 10.1007/978-3-642-24016-4_1
|View full text |Cite
|
Sign up to set email alerts
|

Improved N-grams Approach for Web Page Language Identification

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2013
2013
2019
2019

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 9 publications
(2 citation statements)
references
References 21 publications
0
2
0
Order By: Relevance
“…Cavnar et al (1994) were probably the first; they used the 300 most frequent character n-grams (with n ranging from 1 to 5, as is also typically used in other works). All the n-gram-based ap-proaches differ primarily in the calculation of the distance between the n-gram profile of the training and test text (Selamat, 2011;Yang and Liang, 2010), or by using additional features on top of the n-gram profiles (Padma et al, 2009;Carter et al, 2013). One of the fairly robust definitions of the distance (or similarity) was proposed by Choong et al (2009) who simply check the proportion of n-gram types seen in the tested document of the most frequent n-gram types extracted from training documents for each language.…”
Section: Related Workmentioning
confidence: 99%
“…Cavnar et al (1994) were probably the first; they used the 300 most frequent character n-grams (with n ranging from 1 to 5, as is also typically used in other works). All the n-gram-based ap-proaches differ primarily in the calculation of the distance between the n-gram profile of the training and test text (Selamat, 2011;Yang and Liang, 2010), or by using additional features on top of the n-gram profiles (Padma et al, 2009;Carter et al, 2013). One of the fairly robust definitions of the distance (or similarity) was proposed by Choong et al (2009) who simply check the proportion of n-gram types seen in the tested document of the most frequent n-gram types extracted from training documents for each language.…”
Section: Related Workmentioning
confidence: 99%
“…Automatic language identification of text documents was addressed early on as an information processing task (Beesley 1988; Singh 2006). Further approaches to analysing web page language is also available (Selamat 2011). How to assess quality on the level of punctuation and misspelling and approaches to developing automated grading systems are discussed in Agichtein et al (2008).…”
Section: Conclusion and Taking It Forwardmentioning
confidence: 99%