Three research prototype speech recognition systems are described, all of which use recently developed methods from artificial intelligence (specifically support vector machines, dynamic Bayesian networks, and maximum entropy classification) in order to implement, in the form of an automatic speech recognizer, current theories of human speech perception and phonology (specifically landmark-based speech perception, nonlinear phonology, and articulatory phonology). All three systems begin with a high-dimensional multiframe acoustic-to-distinctive feature transformation, implemented using support vector machines trained to detect and classify acoustic phonetic landmarks. Distinctive feature probabilities estimated by the support vector machines are then integrated using one of three pronunciation models: a dynamic programming algorithm that assumes canonical pronunciation of each word, a dynamic Bayesian network implementation of articulatory phonology, or a discriminative pronunciation model trained using the methods of maximum entropy classification. Log probability scores computed by these models are then combined, using log-linear combination, with other word scores available in the lattice output of a first-pass recognizer, and the resulting combination score is used to compute a second-pass speech recognition output.
In this paper, we present a methodology for combining acoustic-phonetic knowledge with statistical learning for automatic segmentation and classification of continuous speech. At present we focus on the recognition of broad classes -vowel, stop, fricative, sonorant consonant and silence. Judicious use is made of 13 knowledge-based acoustic parameters (APs) and support vector machines (SVMs). It has been shown earlier that SVMs perform comparable to hidden Markov models (HMMs) for detection of stop consonants. We achieve performance on segmentation of continuous speech better than the HMM based approach that uses 39 cepstrum-based speech parameters.
A probabilistic framework for a landmark-based approach to speech recognition is presented for obtaining multiple landmark sequences in continuous speech. The landmark detection module uses as input acoustic parameters (APs) that capture the acoustic correlates of some of the manner-based phonetic features. The landmarks include stop bursts, vowel onsets, syllabic peaks and dips, fricative onsets and offsets, and sonorant consonant onsets and offsets. Binary classifiers of the manner phonetic features-syllabic, sonorant and continuant-are used for probabilistic detection of these landmarks. The probabilistic framework exploits two properties of the acoustic cues of phonetic features-(1) sufficiency of acoustic cues of a phonetic feature for a probabilistic decision on that feature and (2) invariance of the acoustic cues of a phonetic feature with respect to other phonetic features. Probabilistic landmark sequences are constrained using manner class pronunciation models for isolated word recognition with known vocabulary. The performance of the system is compared with (1) the same probabilistic system but with mel-frequency cepstral coefficients (MFCCs), (2) a hidden Markov model (HMM) based system using APs and (3) a HMM based system using MFCCs.
Abstract-We propose a method that combines a probabilistic phonetic feature hierarchy with support vector machines for segmentation of continuous speech into five classes -vowel, sonorant consonant, fricative, stop and silence. We show that by using the hierarchy, only four binary classifiers are required to recognize the five classes. Due to the probabilistic nature of the hierarchy, the method overcomes the disadvantage of the traditional acoustic-phonetic methods where the error is carried down the hierarchy. In addition, the hierarchical approach allows the use of comparable amount of training data of two classes that each binary classifier is designed to discriminate. The segmentation method with 13 knowledge based parameters performs considerably better than a context-independent Hidden Markov Model (HMM) based approach that uses 39 mel-cepstrum based parameters.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.