A boundary refinement method using a new hidden Markov model (HMM) topology is proposed for automatic phonetic speech segmentation. The proposed method has the ability to work at high frame rates and the training and boundary refinement stages are easy and fast. The method is data driven and can be adapted to any speech segmentation problem provided that a training set is available. Given an initial segmentation obtained by forced alignment using an HMM based phone recogniser, 20% decrease in boundary errors is achieved.Introduction: Boundary refinement aims to improve precision in phonetic boundary locations of a speech waveform by using the boundary locations estimated by an automatic speech segmentation (AS) system, and acoustical and statistical knowledge about speech. Hidden Markov model (HMM) based speech recognisers are used for AS. They work at frame rates of 100 frame/s, which is a relatively lower value for the required segmentation accuracy (200 to 1000 frame/s). This is also the case with AS systems other than HMM based AS systems. Therefore, two-stage approaches are widely used in the literature. The boundaries obtained after the first stage have very few gross errors and many fine errors owing to poor time resolution. The refinement process has to decrease the magnitudes of the small errors without giving rise to additional large errors.Several approaches to boundary refinement exist in the literature; in [1], average deviations from the hand labelled boundaries are calculated for different boundary classes and the boundaries from the first stage are shifted by boundary specific average deviation. A context dependent approach [2] uses boundary models composed of a fixed length sequence of Gaussian mixture models (GMMs) for every phoneme pair. Ultimately, the boundary is found around the boundary point estimated in the first stage so as to maximise its likelihood given the model. Another method aims to minimise audible signal discontinuities caused by spectral mismatches when concatenating these units [3]. The weighted spectral slope metric, [4], is adapted to find the boundary as the point at which the spectral discontinuity is maximum. The search interval for the maximisation is determined according to the boundary class. A more comprehensive work [5] involves building an artificial neural network (ANN) boundary model for the second stage, which uses statistical information such as average durations of the phones in the database and the probability distribution function of the boundary around the boundary found at the first stage and also acoustic features such as energy, correlation and the log-energy spectrum of the signal. In this Letter, a boundary refinement method based on a new HMM topology is presented. Preliminary work tested on only two phoneme-tophoneme boundaries was presented (in Turkish) at a local conference [6]. The work described here involves improved training and test stages, a new boundary-phoneme-class based approach to apply the boundary refinement to all phoneme couples, and ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.