Maximum Mutual Information Estimation of Hidden Markov Models

Abstract. We compare two successful discriminative classification algorithms on three databases from the UCI and STATLOG repositories. The two approaches are the log-linear model for the class posterior probabilities and class-dependent weighted dissimilarity measures for nearest neighbor classifiers. The experiments show that the maximum entropy based log-linear classifier performs better for the equivalent of a single prototype. On the other hand, using multiple prototypes the weighted dissimilarity measures outperforms the log-linear approach. This result suggests an extension of the log-linear method to multiple prototypes.

show abstract

“…This criterion is often referred to as mutual information criterion in speech recognition, information theory and image object recognition [2,8].…”

Section: Classification Frameworkmentioning

confidence: 99%

Comparison of Log-linear Models and Weighted Dissimilarity Measures

Keysers

Paredes

Vidal

et al. 2003

Pattern Recognition and Image Analysis

View full text Add to dashboard Cite

show abstract

“…As for the case of the EB algorithm, the derivatives for reestimation of the mixture weights are replaced by smoothed versions according to (Normandin, 1996).…”

Section: Gradient Descentmentioning

confidence: 99%

“…Most applications of discriminative training methods for speech recognition use either the maximum mutual information (MMI) (Bahl et al, 1986;Brown, 1987;Cardin et al, 1993;Chow, 1990;Kapadia et al, 1993;Normandin, 1996;Normandin et al, 1994a,b;Normandin and Morgera, 1991;Reichl and Ruske, 1995;Valtchev et al, 1996Valtchev et al, , 1997 or the minimum classi®cation error (MCE) (Chou et al, 1992(Chou et al, , 1993(Chou et al, , 1994Paliwal et al, 1995;Reichl and Ruske, 1995) criterion. In MCE training, an approximation to the error rate on the training data is optimized, whereas MMI training optimizes the a posteriori probability of the training utterances and hence the class separability.…”

Section: Introductionmentioning

confidence: 99%

Comparison of discriminative training criteria and optimization methods for speech recognition

Schlüter

Macherey

Müller

et al. 2001

Speech Communication

View full text Add to dashboard Cite

The aim of this work is to build up a common framework for a class of discriminative training criteria and optimization methods for continuous speech recognition. A uni®ed discriminative criterion based on likelihood ratios of correct and competing models with optional smoothing is presented. The uni®ed criterion leads to particular criteria through the choice of competing word sequences and the choice of smoothing. Analytic and experimental comparisons are presented for both the maximum mutual information (MMI) and the minimum classi®cation error (MCE) criterion together with the optimization methods gradient descent (GD) and extended Baum (EB) algorithm. A tree search-based restricted recognition method using word graphs is presented, so as to reduce the computational complexity of large vocabulary discriminative training. Moreover, for MCE training, a method using word graphs for e cient calculation of discriminative statistics is introduced. Experiments were performed for continuous speech recognition using the ARPA wall street journal (WSJ) corpus with a vocabulary of 5k words and for the recognition of continuously spoken digit strings using both the TI digit string corpus for American English digits, and the SieTill corpus for telephone line recorded German digits. For the MMI criterion, neither analytical nor experimental results do indicate signi®cant di erences between EB and GD optimization. For acoustic models of low complexity, MCE training gave signi®cantly better results than MMI training. The recognition results for large vocabulary MMI training on the WSJ corpus show a signi®cant dependence on the context length of the language model used for training. Best results were obtained using a unigram language model for MMI training. No signi®cant correlation has been observed between the language models chosen for training and recognition. Ó 2001 Elsevier Science B.V. All rights reserved. ZusammenfassungZiel dieser Arbeit ist die Scha ung eines einheitlichen Rahmens fur eine Klasse von diskriminativen Trainingskriterien und Optimierungsmethoden fur die kontinuierliche Spracherkennung. Dazu wird ein einheitliches Kriterium de®niert, das auf Wahrscheinlichkeitsverhaltnissen von korrekten und konkurrierenden Modellen basiert. Spezielle Kriterien ergeben sich daraus durch die Wahl der konkurrierenden Wortfolgen sowie der Glattung. Fur die Kriterien maximum mutual information (MMI) und minimum classi®cation error (MCE), sowie deren Optimierung mittels Gradientenabstieg (GD) und erweitertem Baum ( ResumeLe but de ce travail est de de®nir un cadre commun incluant un ensemble de criteres d'apprentissage discriminant et de methodes d'optimisation pour la reconnaissance de la parole continue. Nous introduisons un critere discriminant fonde sur le rapport entre la vraissemblance des modeles corrects et concurrents. Ce critere general conduit a de®nir des criteres speci®ques par le choix des sequences de mots en concurrence et par celui de la methode de lissage. Des comparaisons analytiques et experiment...

show abstract

“…The main idea behind Discriminative training (DT) is to introduce a discriminative criterion to the training method of Hidden Markov Models (HMMs). Several discriminative training methods have been proposed for ASR, such as maximum mutual information estimation (MMIE) [2,3,4], minimum classification error (MCE) [5,6,7]; and minimum word/phone error (MWE/MPE) [8,9]. For Hidden Markov (HMM) based speech recognition, conventional discriminative training criterions directly minimize the empirical risk on the training data sample and do not focus on the model generalization.…”

Section: Introductionmentioning

confidence: 99%