Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2014
DOI: 10.3115/v1/d14-1092
|View full text |Cite
|
Sign up to set email alerts
|

A Joint Model for Unsupervised Chinese Word Segmentation

Abstract: In this paper, we propose a joint model for unsupervised Chinese word segmentation (CWS). Inspired by the "products of experts" idea, our joint model firstly combines two generative models, which are word-based hierarchical Dirichlet process model and character-based hidden Markov model, by simply multiplying their probabilities together. Gibbs sampling is used for model inference. In order to further combine the strength of goodness-based model, we then integrated nVBE into our joint model by using it to init… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
18
0

Year Published

2015
2015
2021
2021

Publication Types

Select...
7
1
1

Relationship

1
8

Authors

Journals

citations
Cited by 17 publications
(18 citation statements)
references
References 13 publications
0
18
0
Order By: Relevance
“…The only available resource is a very small bi-lingual lexicon with 1,000 most common Chinese words 7 and their corresponding English translations. In this setting, we use an unsupervised Chinese word segmentation approach combining a Hierarchical Dirichlet Process (HDP) model with a Bayesian HMM model (Chen et al, 2014) to segment Chinese text instead of the preprocessing steps mentioned in Section 4.1.1. According to Figure 5, our approach still performs well in the low-resource setting although its accuracy curve is lower than that in rich-resource settings, demonstrating it can work in both rich-and low-resource settings.…”
Section: Resultsmentioning
confidence: 99%
“…The only available resource is a very small bi-lingual lexicon with 1,000 most common Chinese words 7 and their corresponding English translations. In this setting, we use an unsupervised Chinese word segmentation approach combining a Hierarchical Dirichlet Process (HDP) model with a Bayesian HMM model (Chen et al, 2014) to segment Chinese text instead of the preprocessing steps mentioned in Section 4.1.1. According to Figure 5, our approach still performs well in the low-resource setting although its accuracy curve is lower than that in rich-resource settings, demonstrating it can work in both rich-and low-resource settings.…”
Section: Resultsmentioning
confidence: 99%
“…We evaluate our models on SIGHAN 2005 bakeoff (Emerson, 2005) datasets and replace all the punctuation marks with punc , English characters with eng and Arabic numbers with num (Chen et al, 2014;Wang et al, 2011;Mochihashi et al, 2009;Magistry and Sagot, 2012) for all text and only consider segment the text between punctuations. Following Chen et al (2014) , we use both training data and test data for training and only test data are used for evaluation. In order to make a fair comparison with the previous works, we do not consider using other larger raw corpus.…”
Section: Experimental Settings and Detailmentioning
confidence: 99%
“…Kyoto BCCWJ MSR CITYU BEST Precision (All) 99.9 99.9 99.6 99.9 99.0 Precision ( Table 4: Accuracies of unsupervised word segmentation. BE is a Branching Entropy method of Zhikov et al (2010), and HMM 2 is a product of word and character HMMs of Chen et al (2014). * is the accuracy decoded with L = 3: it becomes 81.7 with L = 4 as MSR and PKU.…”
Section: Datasetmentioning
confidence: 99%
“…This means that we want the most "natural" segmentation w that have a high probability in a language model p(w|s). Lately, Chen et al (2014) proposed an intermediate model between heuristic and statistical models as a product of character and word HMMs. However, these two models do not have information shared between the models, which is not the case with generative models.…”
Section: Introductionmentioning
confidence: 99%