A Compression-based Algorithm for Chinese Word Segmentation

Teahan, William J.; McNab, Rodger J.; Wen, Yingying; Witten, Ian H.

doi:10.1162/089120100561746

Cited by 107 publications

(70 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this case, only 108 discovered words were fragments, and only 842 (fewer than 5%) true words were missed (most of which are words such as "moby" and "dick" that tend to be recognized as one composed word "mobydick"). The sensitivity, adjusted sensitivity, and specificity of word segmentation increased to 76%, 95%, and 99%, respectively, which is comparable to the current best supervised methods (8)(9)(10)(11)(12)(13). More details can be found in SI Appendix, Table S1, Fig.…”

Section: Resultssupporting

confidence: 54%

“…Many available methods for processing Chinese texts focus on word segmentation and often assume that either a comprehensive dictionary or a large training corpus (usually texts manually segmented and labeled from news articles) is available. These methods can be classified into three categories: (i) methods based on word matching (3), (ii) methods based on grammatical rules (4-6), and (iii) methods based on statistical models [e.g., hidden Markov model (7) and its extensions (8), maximum entropy Markov model (9), conditional random field (10)(11)(12), and information compression (13)]. These methods, especially the ones based on statistical models, work quite well when the given dictionary and training corpus are sufficient and effective.…”

mentioning

confidence: 99%

See 1 more Smart Citation

On the unsupervised analysis of domain-specific Chinese texts

Deng

Bol

et al. 2016

Proc. Natl. Acad. Sci. U.S.A.

View full text Add to dashboard Cite

With the growing availability of digitized text data both publicly and privately, there is a great need for effective computational tools to automatically extract information from texts. Because the Chinese language differs most significantly from alphabet-based languages in not specifying word boundaries, most existing Chinese text-mining methods require a prespecified vocabulary and/or a large relevant training corpus, which may not be available in some applications. We introduce an unsupervised method, top-down word discovery and segmentation (TopWORDS), for simultaneously discovering and segmenting words and phrases from large volumes of unstructured Chinese texts, and propose ways to order discovered words and conduct higher-level context analyses. TopWORDS is particularly useful for mining online and domain-specific texts where the underlying vocabulary is unknown or the texts of interest differ significantly from available training corpora. When outputs from TopWORDS are fed into context analysis tools such as topic modeling, word embedding, and association pattern finding, the results are as good as or better than that from using outputs of a supervised segmentation method.ue to the explosive growth of the Internet technology and the public adoption of the Internet as a main culture media, a large amount of text data is available. It is more and more attractive for many researchers to extract information from diverse text data to create new knowledge. Biomedical researchers can gain understanding on how diseases, symptoms, and other features are spatially, temporally, and ethnically distributed and associated with each other by mining research articles and electronic medical records. Marketers can learn what consumers say about their products and services by analyzing online reviews and comments. Social scientists can discover hot events from news articles, web pages, blogs, and tweets and infer driving forces behind them. Historians can extract information about historical figures from historical documents: who they were, what they did, and what social relationships they had with other historical figures.For alphabet-based languages such as English, many successful learning methods have been proposed (see ref. 1 for a review). For character-based languages such as Chinese and other East Asian languages, effective learning algorithms are still limited. Chinese has a much larger "alphabet" and vocabulary than English: Zhonghua Zihai Dictionary (2) lists 87,019 distinct Chinese characters, of which 3,000 are commonly used; and the vocabulary of Chinese is an open set when named entities are included. Additionally, morphological variations in Latin-derived languages (e.g., uppercase or lowercase letters, tense and voice changes), which provide useful hints for text mining, do not exist in Chinese. Because there is no space between Chinese characters in each sentence, significant ambiguities are present in deciphering its meaning.There are two critical challenges in processing Chinese texts: (i) word segment...

show abstract

Section: Resultssupporting

confidence: 54%

mentioning

confidence: 99%

On the unsupervised analysis of domain-specific Chinese texts

Deng

Bol

et al. 2016

Proc. Natl. Acad. Sci. U.S.A.

View full text Add to dashboard Cite

show abstract

“…The tagging problem resembles the word segmentation problem in some natural languages where no clear separations exist between different words [15]. In the word segmentation problem, the task is to find correct separations between sequences of characters to form words.…”

Section: Code Segmentationmentioning

confidence: 99%

Differentiating Code from Data in x86 Binaries

Wartell¹,

Zhou²,

Hamlen³

et al. 2011

Machine Learning and Knowledge Discovery in Databases

View full text Add to dashboard Cite

Abstract. Robust, static disassembly is an important part of achieving high coverage for many binary code analyses, such as reverse engineering, malware analysis, reference monitor in-lining, and software fault isolation. However, one of the major difficulties current disassemblers face is differentiating code from data when they are interleaved. This paper presents a machine learning-based disassembly algorithm that segments an x86 binary into subsequences of bytes and then classifies each subsequence as code or data. The algorithm builds a language model from a set of pre-tagged binaries using a statistical data compression technique. It sequentially scans a new binary executable and sets a breaking point at each potential code-to-code and code-to-data/data-to-code transition. The classification of each segment as code or data is based on the minimum cross-entropy. Experimental results are presented to demonstrate the effectiveness of the algorithm.

show abstract

“…It produces state-of-theart text compression results for many languages as detailed in the reports mentioned in [31], [36], [57]. PPM has been used as the basis for an effective method for performing Chinese word segmentation where spaces (as word separators) are inserted into Chinese text which has no spaces [33]. Other studies such as [31], [34]- [36], [57], [58] have reported using PPM for different languages for other NLP tasks such as cryptology, code switching, authorship attribution, text correction and speech recognition.…”

Section: Ppm-based Compression For Natural Language Textmentioning

confidence: 99%

“…This example uses a specific variant of PPM prediction method, PPMD, to model the string ‫.أبجدبهىبأأبجد‬ As stated, a model's maximum order of 5 is proven to be efficient, but a maximum model order of 2 is used in this example for illustration purposes. In the table, c shows the count, p expresses the probability and the size of the alphabet used is represented by | | [33]. For this example, let the next character be letter ‫.ب‬ This character has been seen once before ‫"جد"(‬ → ‫)ب‬ for the order two context ‫"جد"‬ and consequently it has a probability of ½ (utilising equation (1) as the count is 1).…”

Section: Ppm-based Compression For Natural Language Textmentioning

confidence: 99%

Classifying and Segmenting Classical and Modern Standard Arabic using Minimum Cross-Entropy

Alkhazi¹,

Teahan²

2017

ijacsa

View full text Add to dashboard Cite

Further experiments with the same corpora showed that a more accurate picture of the contents of the corpora was possible using the PPM-based segmentation method. Tag-based compression experiments (using tags produced by parts-of-speech Arabic taggers) also showed that the quality of the tagging (as measured by compression quality) is significantly affected when tagging either CA and MSA text. The conclusion is that NLP applications (such as taggers) should treat these texts separately and use different training data for each or process them differently.

show abstract

A Compression-based Algorithm for Chinese Word Segmentation

Cited by 107 publications

References 14 publications

On the unsupervised analysis of domain-specific Chinese texts

On the unsupervised analysis of domain-specific Chinese texts

Differentiating Code from Data in x86 Binaries

Classifying and Segmenting Classical and Modern Standard Arabic using Minimum Cross-Entropy

Contact Info

Product

Resources

About