2002
DOI: 10.1155/s1110865702206150
|View full text |Cite
|
Sign up to set email alerts
|

Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition

Abstract:

It has been shown that integration of acoustic and visual information especially in noisy conditions yields improved speech recognition results. This raises the question of how to weight the two modalities in different noise conditions. Throughout this paper we develop a weighting process adaptive to various background noise situations. In the presented recognition system, audio and video data are combined following a Separate Integration (SI) architecture. A hybrid Artificial Neural Network/Hidden Mar… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
83
0

Year Published

2004
2004
2013
2013

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 85 publications
(85 citation statements)
references
References 30 publications
2
83
0
Order By: Relevance
“…However, this is insufficient, since attaining optimal performance requires that we dynamically adjust the share of each stream in the decision process, e.g., to account for visual tracking failures in the AV-ASR case. There have been some efforts towards dynamically adjustable stream weights, as well as stream weights adapted to the phonemic content of audiovisual speech (in the form of unit-or even class-dependent stream weights) [30]- [32]; however, stream weight tuning in this context is challenging, typically requiring extensive training sets.…”
Section: A Stream Weights In Multimodal Fusionmentioning
confidence: 99%
“…However, this is insufficient, since attaining optimal performance requires that we dynamically adjust the share of each stream in the decision process, e.g., to account for visual tracking failures in the AV-ASR case. There have been some efforts towards dynamically adjustable stream weights, as well as stream weights adapted to the phonemic content of audiovisual speech (in the form of unit-or even class-dependent stream weights) [30]- [32]; however, stream weight tuning in this context is challenging, typically requiring extensive training sets.…”
Section: A Stream Weights In Multimodal Fusionmentioning
confidence: 99%
“…In contrast to the fusion of previous independent processing of each modality [1], the integration could occur at the feature level. In this case audio and video features are concatenated into larger feature-vectors, which are then processed by a single algorithm.…”
Section: Choosing a Fusion Spacementioning
confidence: 99%
“…Multiple data streams may be from different sensory modalities, e.g. video and audio [2], or from different representations of the same input stream, such as analysis on different time scales [3], or static and time difference features as used in this paper. We are working with the full-combination multi-stream (FCMS) HMM/ANN approach for noise robust ASR, whose superiority was shown in [3].…”
Section: Introductionmentioning
confidence: 99%