“…Looking for alternatives, extensive studies have been carried out towards the modelling of the spectral dynamics of speech in an HMM framework. These studies can be traced, typically, from the bigram or N-gram constrained HMM (Wellekens, 1987;Paliwal, 1993;Takahashi, Matsuoka, Minami & Shikano, 1993;Nakagawa & Yamamoto, 1996), to the linear-predictive HMM (Kenny, Lennig & Mermelstein, 1990;Woodland, 1992), and to a large family of segment-based HMMs (see, for example, Brown, 1987;Russell, 1993;Bahl, de Souza, Gopalakrishnan, Nahamoo & Picheny, 1994;Milner & Vaseghi, 1994;Zavaliagkos, Zhao, Schwartz & Makhoul, 1994;Nakagawa & Yamamoto, 1996). In the bigram, N-gram or linear-predictive HMMs, each observed frame is made to be explicitly dependent on one or more previous frames by using some type of conditional observation densities; and in the segment-based HMMs, segments of frames are handled by using a neural net (Zavaliagkos et al, 1994), by assuming some type of conditional independence (Russell, 1993), or by performing some transformations to reduce the data size and/or to achieve some robustness of the resulting parameters (Bahl et al, 1994;Milner & Vaseghi, 1994;Nakagawa & Yamamoto, 1996).…”