Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics - 1994
DOI: 10.3115/981732.981742
|View full text |Cite
|
Sign up to set email alerts
|

A stochastic finite-state word-segmentation algorithm for Chinese

Abstract: We present a stochastic finite-state model for segmenting Chinese text into dictionary entries and productively derived words, and providing pronunciations for these words; the method incorporates a class-based model in its treatment of personal names. We also evaluate the system's performance, taking into account the fact that people often do not agree on a single segmentation.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
161
1

Year Published

1999
1999
2014
2014

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 136 publications
(164 citation statements)
references
References 23 publications
2
161
1
Order By: Relevance
“…Such ambiguity in the definition of what constitutes a word makes it difficult to evaluate segmentation algorithms that follow different conventions, as it is nearly impossible to construct a "gold standard" against which to directly compare results [7]. As shown in [23], the rate of agreement between two human judges on this task is less than 80%. The performance of word segmentation is usually measured using precision and recall, where recall is defined as the percent of words in the manually segmented text identified by the segmentation algorithm, and precision is defined as the percentage of words returned by the algorithm that also occurred in the hand-segmented text in the same position.…”
Section: Evaluation and Experimental Resultsmentioning
confidence: 99%
“…Such ambiguity in the definition of what constitutes a word makes it difficult to evaluate segmentation algorithms that follow different conventions, as it is nearly impossible to construct a "gold standard" against which to directly compare results [7]. As shown in [23], the rate of agreement between two human judges on this task is less than 80%. The performance of word segmentation is usually measured using precision and recall, where recall is defined as the percent of words in the manually segmented text identified by the segmentation algorithm, and precision is defined as the percentage of words returned by the algorithm that also occurred in the hand-segmented text in the same position.…”
Section: Evaluation and Experimental Resultsmentioning
confidence: 99%
“…Sproat et al, 1996). As an example, consider the Chinese character sequence which forms a complete noun in the sentence…”
Section: Languages Without Word Separationmentioning
confidence: 99%
“…E.g., Sproat et al (1996) give a good overview of the problems text analysis for Chinese is confronted with.…”
Section: Language-dependent Syntactic Structure Analysismentioning
confidence: 99%
“…A brief sampling of areas where various automata show up as the underlying formalism include natural language processing (speech recognition, morphological analysis), computational linguistics, robotics and control systems, computational biology (phylogeny, structural pattern recognition), data mining, time series and music (Koskenniemi, 1983;de la Higuera, 2005;Mohri, 1996;Mohri et al, 2002;Mohri, 1997;Mohri et al, 2010;Rambow et al, 2002;Sproat et al, 1996). Thus, developing efficient formal language learning techniques and understanding their limitations is of a broad and direct relevance in the digital realm.…”
Section: Introductionmentioning
confidence: 99%