2002
DOI: 10.1145/772755.772759
|View full text |Cite
|
Sign up to set email alerts
|

A language and character set determination method based on N-gram statistics

Abstract: An N-gram-based language, script, and encoding scheme-detection method is introduced in this article. The method detects language, script, and encoding schemes using a target text document encoded by computer by checking how many byte sequences of the target match the byte sequences that can appear in the texts belonging to a language, script, and encoding scheme. This detection mechanism is different from conventional N-gram-based methods in that its threshold for any category is uniquely predetermined. The m… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
20
0
1

Year Published

2007
2007
2024
2024

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 28 publications
(21 citation statements)
references
References 1 publication
0
20
0
1
Order By: Relevance
“…The authors introduced a binary indexed linear classifier [1] for language identification in the Language Observatory Project. The term weighting schemes of this classifier are configured as follows:…”
Section: Review Of Related Researchmentioning
confidence: 99%
See 3 more Smart Citations
“…The authors introduced a binary indexed linear classifier [1] for language identification in the Language Observatory Project. The term weighting schemes of this classifier are configured as follows:…”
Section: Review Of Related Researchmentioning
confidence: 99%
“…One information retrieval technique, a method that finds documents that are "similar" to a given snippet of text will make searching more efficient in many ways. The authors have introduced binary similarity [1]: a linear classifier in which training documents are indexed in a binary manner. In this indexing system, the weight of a term is set to 1 if it occurs in the training documents and 0 if it does not, as defined in Eq.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…The huge data are stored in clusters of 40 servers at Nagaoka University of Technology under the management of the Language Observatory (LO) project (Mikami et al 2005). The system based on N-gram technology (Suzuki et al 2002) has been extensively used in the past four years in a project for LSE identification of huge domains.…”
Section: Snapshots Of Evolutionized Malay (Em) Occurrences In Web Formentioning
confidence: 99%