In this paper, we propose a new Bayesian model for fully unsupervised word segmentation and an efficient blocked Gibbs sampler combined with dynamic programming for inference. Our model is a nested hierarchical Pitman-Yor language model, where Pitman-Yor spelling model is embedded in the word model. We confirmed that it significantly outperforms previous reported results in both phonetic transcripts and standard datasets for Chinese and Japanese word segmentation. Our model is also considered as a way to construct an accurate word n-gram language model directly from characters of arbitrary language, without any "word" indications.
The aim of this work is to apply a sampling approach to speech modeling, and propose a Gibbs sampling based Multi-scale Mix ture Model (M 3 ). The proposed approach focuses on the multi-scale property of speech dynamics, Le., dynamics in speech can be ob served on, for instance, short-time acoustical, linguistic-segmental, and utterance-wise temporal scales. M 3 is an extension of the Gaus sian mixture model and is considered a hierarchical mixture model, where mixture components in each time scale will change at inter vals of the corresponding time unit. We derive a fully Bayesian treat ment of the multi-scale mixture model based on Gibbs sampling. The advantage of the proposed model is that each speaker cluster can be precisely modeled based on the Gaussian mixture model unlike conventional single-Gaussian based speaker clustering (e.g., using the Bayesian Information Criterion (BIC)). In addition, Gibbs sam pling offers the potential to avoid a serious local optimum problem. Speaker clustering experiments confirmed these advantages and ob tained a significant improvement over the conventional BIC based approaches.
We propose a nonparametric Bayesian model for joint unsupervised word segmentation and part-of-speech tagging from raw strings. Extending a previous model for word segmentation, our model is called a Pitman-Yor Hidden Semi-Markov Model (PYHSMM) and considered as a method to build a class n-gram language model directly from strings, while integrating character and word level information. Experimental results on standard datasets on Japanese, Chinese and Thai revealed it outperforms previous results to yield the state-of-the-art accuracies. This model will also serve to analyze a structure of a language whose words are not identified a priori.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.