Conversational speech recognition is a challenging problem primarily because speakers rarely fully articulate sounds. A successful speech recognition approach must infer intended spectral targets from the speech data, or develop a method of dealing with large variances in the data. Hidden Dynamic Models (HDMs) attempt to automatically learn such targets in a hidden feature space using models that integrate linguistic information with constrained temporal trajectory models. HDMs are a radical departure from conventional hidden Markov models (HMMs), which simply account for variation in the observed data. In this paper, we present an initial evaluation of such models on a conversational speech recognition task involving a subset of the SWITCHBOARD corpus. We show that in an N-Best rescoring paradigm, HDMs are capable of delivering performance competitive with Hh4Ms. produce more consistent acoustic scoring for conversational speech, because sounds are rarely fully articulated in such data. Tremendous amounts of variation are observed in the speech data because of the manner in which the realization of a sound was truncated is highly context-dependent. It is the goal of this work to produce acoustic scores that reflect measurements in the hidden (or target) space, rather than directly in the feature space as is currently done in context-dependent phonetic modeling.The work presented here was the culmination of an intense effort at the 1998 NSF Workshop on Language Engineering held at the Center for Language and Speech Processing at Johns Hopkins University. One goal of this work, which is the primary focus of this paper, was to evaluate the HDM approach on a credible conversational speech recognition task involving the SWITCHBOARD (SWB) Corpus [3].
HIDDEN DYNAMIC MODELS 1. INTRODUCTIONHidden dynamic models [ 1,2] (HDMs) attempt to produce acoustic likelihoods of phone-level sound units that reflect intended spectral configurations rather than likelihoods based on the actual realization of the sound in the speech data. This is a radical departure from current statistical modeling approaches that attempt to account for variation in the data by accumulating large numbers of Gaussian mixture components. It is conjectured that this approach will 1.