2000
DOI: 10.1162/089120100561746
|View full text |Cite
|
Sign up to set email alerts
|

A Compression-based Algorithm for Chinese Word Segmentation

Abstract: Chinese is written without using spaces or other word delimiters. Although a text may be thought of as a corresponding sequence of words, there is considerable ambiguity in the placement of boundaries. Interpreting a text as a sequence of words is beneficial for some information retrieval and storage tasks: for example, full-text search, word-based compression, and keyphrase extraction. We describe a scheme that infers appropriate positions for word boundaries using an adaptive language model that is standard … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

2
68
0

Year Published

2002
2002
2017
2017

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 107 publications
(70 citation statements)
references
References 14 publications
2
68
0
Order By: Relevance
“…In this case, only 108 discovered words were fragments, and only 842 (fewer than 5%) true words were missed (most of which are words such as "moby" and "dick" that tend to be recognized as one composed word "mobydick"). The sensitivity, adjusted sensitivity, and specificity of word segmentation increased to 76%, 95%, and 99%, respectively, which is comparable to the current best supervised methods (8)(9)(10)(11)(12)(13). More details can be found in SI Appendix, Table S1, Fig.…”
Section: Resultssupporting
confidence: 54%
See 1 more Smart Citation
“…In this case, only 108 discovered words were fragments, and only 842 (fewer than 5%) true words were missed (most of which are words such as "moby" and "dick" that tend to be recognized as one composed word "mobydick"). The sensitivity, adjusted sensitivity, and specificity of word segmentation increased to 76%, 95%, and 99%, respectively, which is comparable to the current best supervised methods (8)(9)(10)(11)(12)(13). More details can be found in SI Appendix, Table S1, Fig.…”
Section: Resultssupporting
confidence: 54%
“…Many available methods for processing Chinese texts focus on word segmentation and often assume that either a comprehensive dictionary or a large training corpus (usually texts manually segmented and labeled from news articles) is available. These methods can be classified into three categories: (i) methods based on word matching (3), (ii) methods based on grammatical rules (4-6), and (iii) methods based on statistical models [e.g., hidden Markov model (7) and its extensions (8), maximum entropy Markov model (9), conditional random field (10)(11)(12), and information compression (13)]. These methods, especially the ones based on statistical models, work quite well when the given dictionary and training corpus are sufficient and effective.…”
mentioning
confidence: 99%
“…The tagging problem resembles the word segmentation problem in some natural languages where no clear separations exist between different words [15]. In the word segmentation problem, the task is to find correct separations between sequences of characters to form words.…”
Section: Code Segmentationmentioning
confidence: 99%
“…It produces state-of-theart text compression results for many languages as detailed in the reports mentioned in [31], [36], [57]. PPM has been used as the basis for an effective method for performing Chinese word segmentation where spaces (as word separators) are inserted into Chinese text which has no spaces [33]. Other studies such as [31], [34]- [36], [57], [58] have reported using PPM for different languages for other NLP tasks such as cryptology, code switching, authorship attribution, text correction and speech recognition.…”
Section: Ppm-based Compression For Natural Language Textmentioning
confidence: 99%
“…This example uses a specific variant of PPM prediction method, PPMD, to model the string ‫.أبجدبهىبأأبجد‬ As stated, a model's maximum order of 5 is proven to be efficient, but a maximum model order of 2 is used in this example for illustration purposes. In the table, c shows the count, p expresses the probability and the size of the alphabet used is represented by | | [33]. For this example, let the next character be letter ‫.ب‬ This character has been seen once before ‫"جد"(‬ → ‫)ب‬ for the order two context ‫"جد"‬ and consequently it has a probability of ½ (utilising equation (1) as the count is 1).…”
Section: Ppm-based Compression For Natural Language Textmentioning
confidence: 99%