We report on investigations, conducted at the 2006 Johns Hopkins Workshop, into the use of articulatory features (AFs) for observation and pronunciation models in speech recognition. In the area of observation modeling, we use the outputs of AF classifiers both directly, in an extension of hybrid HMM/neural network models, and as part of the observation vector, an extension of the "tandem" approach. In the area of pronunciation modeling, we investigate a model having multiple streams of AF states with soft synchrony constraints, for both audio-only and audio-visual recognition. The models are implemented as dynamic Bayesian networks, and tested on tasks from the Small-Vocabulary Switchboard (SVitchboard) corpus and the CUAVE audio-visual digits corpus. Finally, we analyze AF classification and forced alignment using a newly collected set of feature-level manual transcriptions.
The so-called tandem approach, where the posteriors of a multilayer perceptron (MLP) classi er are used as features in an automatic speech recognition (ASR) system has proven to be a very effective method. Most tandem approaches up to date have relied on MLPs trained for phone classi cation, and appended the posterior features to some standard feature hidden Markov model (HMM). In this paper, we develop an alternative tandem approach based on MLPs trained for articulatory feature (AF) classi cation. We also develop a factored observation model for characterizing the posterior and standard features at the HMM outputs, allowing for separate hidden mixture and state-tying structures for each factor. In experiments on a subset of Switchboard, we show that the AFbased tandem approach is as effective as the phone-based approach, and that the factored observation model signi cantly outperforms the simple feature concatenation approach while using fewer parameters.
In economics and political science, the theoretical literature on social choice routinely highlights worst-case scenarios and emphasizes the nonexistence of a universally best voting method. Behavioral social choice is grounded in psychology and tackles consensus methods descriptively and empirically. We analyzed four elections of the American Psychological Association using a state-of-the-art multimodel, multimethod approach. These elections provide rare access to (likely sincere) preferences of large numbers of decision makers over five choice alternatives. We determined the outcomes according to three classical social choice procedures: Condorcet, Borda, and plurality. Although the literature routinely depicts these procedures as irreconcilable, we found strong statistical support for an unexpected degree of empirical consensus among them in these elections. Our empirical findings stand in contrast to two centuries of pessimistic thought experiments and computer simulations in social choice theory and demonstrate the need for more systematic descriptive and empirical research on social choice than exists to date.
In recent years, the features derived from posteriors of a multilayer perceptron (MLP), known as tandem features, have proven to be very effective for automatic speech recognition. Most tandem features to date have relied on MLPs trained for phone classification. We recently showed on a relatively small data set that MLPs trained for articulatory feature classification can be equally effective. In this paper, we provide a similar comparison using MLPs trained on a much larger data set-2000 hours of English conversational telephone speech. We also explore how portable phone-and articulatory featurebased tandem features are in an entirely different language-Mandarin-without any retraining. We find that while the phone-based features perform slightly better in the matchedlanguage condition, they perform significantly better in the cross-language condition. Yet, in the cross-language condition, neither approach is as effective as the tandem features extracted from an MLP trained on a relatively small amount of in-domain data. Beyond feature concatenation, we also explore novel observation modeling schemes that allow for greater flexibility in combining the tandem and standard features at hidden Markov model (HMM) outputs.
In this paper we present a family of algorithms for estimating stream weights for dynamic Bayesian networks with multiple observation streams. For the 2 stream case, we present a weight tuning algorithm optimal in the minimum classification error sense. We compare the algorithms to brute-force search where feasible, as well as to previously published algorithms and show that the algorithms perform as well as brute-force search and outperform previously published algorithms. We test the stream weight tuning algorithm in the context of speech recognition with distinctive feature tandem models. We analyze how the criterion used for weight tuning differs from the standard word error rate criterion used in speech recognition.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.