Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition

Heckmann, Martin; Berthommier, Frédéric; Kroschel, Kristian

doi:10.1155/s1110865702206150

Cited by 85 publications

(85 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, this is insufficient, since attaining optimal performance requires that we dynamically adjust the share of each stream in the decision process, e.g., to account for visual tracking failures in the AV-ASR case. There have been some efforts towards dynamically adjustable stream weights, as well as stream weights adapted to the phonemic content of audiovisual speech (in the form of unit-or even class-dependent stream weights) [30]- [32]; however, stream weight tuning in this context is challenging, typically requiring extensive training sets.…”

Section: A Stream Weights In Multimodal Fusionmentioning

confidence: 99%

Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition

Papandreou

Katsamanis

Pitsikalis

et al. 2009

IEEE Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Abstract-While the accuracy of feature measurements heavily depends on changing environmental conditions, studying the consequences of this fact in pattern recognition tasks has received relatively little attention to date. In this paper, we explicitly take feature measurement uncertainty into account and show how multimodal classification and learning rules should be adjusted to compensate for its effects. Our approach is particularly fruitful in multimodal fusion scenarios, such as audiovisual speech recognition, where multiple streams of complementary time-evolving features are integrated. For such applications, provided that the measurement noise uncertainty for each feature stream can be estimated, the proposed framework leads to highly adaptive multimodal fusion rules which are easy and efficient to implement. Our technique is widely applicable and can be transparently integrated with either synchronous or asynchronous multimodal sequence integration architectures. We further show that multimodal fusion methods relying on stream weights can naturally emerge from our scheme under certain assumptions; this connection provides valuable insights into the adaptivity properties of our multimodal uncertainty compensation approach. We show how these ideas can be practically applied for audiovisual speech recognition. In this context, we propose improved techniques for person-independent visual feature extraction and uncertainty estimation with active appearance models, and also discuss how enhanced audio features along with their uncertainty estimates can be effectively computed. We demonstrate the efficacy of our approach in audiovisual speech recognition experiments on the CUAVE database using either synchronous or asynchronous multimodal integration models.Index Terms-Active appearance models (AAMs), audiovisual automatic speech recognition (AV-ASR), multimodal fusion, uncertainty compensation.

show abstract

Section: A Stream Weights In Multimodal Fusionmentioning

confidence: 99%

Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition

Papandreou

Katsamanis

Pitsikalis

et al. 2009

IEEE Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…In contrast to the fusion of previous independent processing of each modality [1], the integration could occur at the feature level. In this case audio and video features are concatenated into larger feature-vectors, which are then processed by a single algorithm.…”

Section: Choosing a Fusion Spacementioning

confidence: 99%

Detection and localization of 3d audio-visual objects using unsupervised clustering

Khalidov

Forbes

Hansard

et al. 2008

Proceedings of the 10th International Conference on Multimodal Interfaces

View full text Add to dashboard Cite

This paper addresses the issues of detecting and localizing objects in a scene that are both seen and heard. We explain the benefits of a human-like configuration of sensors (binaural and binocular) for gathering auditory and visual observations. It is shown that the detection and localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. We propose a probabilistic generative model that captures the relations between audio and visual observations. This model maps the data into a common audio-visual 3D representation via a pair of mixture models. Inference is performed by a version of the expectationmaximization algorithm, which is formally derived, and which provides cooperative estimates of both the auditory activity and the 3D position of each object. We describe several experiments with single-and multiple-speaker detection and localization, in the presence of other audio sources.

show abstract

“…Multiple data streams may be from different sensory modalities, e.g. video and audio [2], or from different representations of the same input stream, such as analysis on different time scales [3], or static and time difference features as used in this paper. We are working with the full-combination multi-stream (FCMS) HMM/ANN approach for noise robust ASR, whose superiority was shown in [3].…”

Section: Introductionmentioning

confidence: 99%

New entropy based combination rules in HMM/ANN multi-stream ASR

Misra

Bourlard

Tyagi

2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03).

View full text Add to dashboard Cite

Classifier performance is often enhanced through combining multiple streams of information. In the context of multistream HMM/ANN systems in ASR, a confidence measure widely used in classifier combination is the entropy of the posteriors distribution output from each ANN, which generally increases as classification becomes less reliable. The rule most commonly used is to select the ANN with the minimum entropy. However, this is not necessarily the best way to use entropy in classifier combination. In this article, we test three new entropy based combination rules in a fullcombination multi-stream HMM/ANN system for noise robust speech recognition. Best results were obtained by combining all the classifiers having entropy below average using a weighting proportional to their inverse entropy.

show abstract

Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition

Cited by 85 publications

References 30 publications

Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition

Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition

Detection and localization of 3d audio-visual objects using unsupervised clustering

New entropy based combination rules in HMM/ANN multi-stream ASR

Contact Info

Product

Resources

About