For large vocabulary and continuous speech recognition, the subword-unit-based approach is a viable alternative to the wholeword-unit-based approach. For preparing a large inventory of subword units, an automatic segmentation is preferrable to manual segmentation as it substantially reduces the work associated with the generation of templates and gives more consistent results. In this paper we discuss some methods for automatically segmenting speech into phonetic units. Three different approaches are described, one based on template matching, one based on detecting the spectral changes that occur at the boundaries between phonetic units and one based on a constrained-clustering vector quantization approach. An evaluation of the performance of the automatic segmentation methods is given.
I INTRODUCTIONAlthough a word-based (WB) approach to speech recognition is popular for its simplicity in implementation and for its good performance for small-to-medium size vocabulary, isolated word recognition tasks, the approach cannot be easily extended to large vocabulary, and/or continuous speech recognition applications. For large vocabulary, isolated word recognition, a large amount of training data, proportional to the vocabulary size, N , is needed t o characterize each individual word model. In continuous speech recognition, the amount of training data needed for characterizing the word junctures is even more demanding, i.e.. on the order of N2. In order to overcome the problem of the training data size, a subword unit, segment-based (SB) approach, where different words can share common segments in their representations, is a more viable alternative than the WB approach. However, preparing a subword segment inventory of a reasonable size, say. 200-1000 entries, is not a trivial task. Manual segmentation can be used but it has two major drawbacks: i) The process is both laborious and tedious, requiring, for example, extensive listening and spectrogram interpretation. ii) Due to the subjective nature of a manual segmentation, there will be inconsistencies from trial to trial, even for segmenting the same utterance. In order t o alleviate these problems, automatic procedures for segmenting speech into sub-word units are investigated in this paper.Over the past years, several procedures for automatic segmentation of speech have been proposed in the literature. Most of the procedures have followed one of two basic approaches to the problem. The first approach is to utilize the explicit information that is known a priori, such as the correct phonetic transcription of the uterance. The incoming speech signal is then segmented using reference templates corresponding to the phonetic events. Variations of this approach are described in references [l] and [2]. The second approach does not require any explicit information, but utilizes only the acoustical information that is contained within the speech signal to be segmented, such as the amount of spectral change from one speech frame to the next. Atal [3] describes a method for t ...