Stream weight tuning in dynamic Bayesian networks

Kantor, Arthur; Hasegawa-Johnson, A.

doi:10.1109/icassp.2008.4518662

Cited by 2 publications

(2 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The auxiliary function in (15) can now be optimized by separately optimizing for each frame . By applying (1) to (15) we get the following expression for the auxiliary function at each : (19) independent of (20) where and are the single-modality acoustical and visual states composing the coupled state .…”

Section: A Expectation Maximization Algorithmmentioning

confidence: 99%

“…While in some prior works, the stream weight for the whole dataset has been set to a fixed value, which was found using grid search, e.g., [11], [14] or using other tuning algorithms, e.g., [15], some authors have assumed that the stream weight is a model parameter and have estimated it using generative [16] or discriminative [17], [18] criteria. In real scenarios, however, the reliability of the audio and video modality can vary quickly, even on the frame level, and such fixed or model-dependent estimation might lead to worse results than using Bayes fusion, i.e., equal weights.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Learning Dynamic Stream Weights For Coupled-HMM-based Audio-visual Speech Recognition

Abdelaziz

Zeiler

Kolossa

2015

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

With the increasing use of multimedia data in communication technologies, the idea of employing visual information in automatic speech recognition (ASR) has recently gathered momentum. In conjunction with the acoustical information, the visual data enhances the recognition performance and improves the robustness of ASR systems in noisy and reverberant environments. In audio-visual systems, dynamic weighting of audio and video streams according to their instantaneous confidence is essential for reliably and systematically achieving high performance. In this paper, we present a complete framework that allows blind estimation of dynamic stream weights for audio-visual speech recognition based on coupled hidden Markov models (CHMMs). As a stream weight estimator, we consider using multilayer perceptrons and logistic functions to map multidimensional reliability measure features to audiovisual stream weights. Training the parameters of the stream weight estimator requires numerous input-output tuples of reliability measure features and their corresponding stream weights. We estimate these stream weights based on oracle knowledge using an expectation maximization algorithm. We define 31-dimensional feature vectors that combine model-based and signal-based reliability measures as inputs to the stream weight estimator. During decoding, the trained stream weight estimator is used to blindly estimate stream weights. The entire framework is evaluated using the Grid audio-visual corpus and compared to state-of-the-art stream weight estimation strategies. The proposed framework significantly enhances the performance of the audio-visual ASR system in all examined test conditions. Index Terms-Audio-visual speech recognition, coupled hidden Markov model, logistic regression, multilayer perceptron, reliability measure, stream weight.

show abstract

Section: A Expectation Maximization Algorithmmentioning

confidence: 99%