1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings
DOI: 10.1109/icassp.1996.543247
|View full text |Cite
|
Sign up to set email alerts
|

Integrating audio and visual information to provide highly robust speech recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
72
0

Publication Types

Select...
6
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 84 publications
(72 citation statements)
references
References 5 publications
0
72
0
Order By: Relevance
“…Furthermore, temporal information between the channels is lost in this approach. AVSR systems based on EI models have for example been described in 6,32] and systems based on LI models in 27,30]. Although it is still not well known how h umans integrate di erent modalities, it is generally agreed that integration occurs before speech is categorised phonetically 5, 3 1 ].…”
Section: Audio-visual Sensor Integrationmentioning
confidence: 99%
“…Furthermore, temporal information between the channels is lost in this approach. AVSR systems based on EI models have for example been described in 6,32] and systems based on LI models in 27,30]. Although it is still not well known how h umans integrate di erent modalities, it is generally agreed that integration occurs before speech is categorised phonetically 5, 3 1 ].…”
Section: Audio-visual Sensor Integrationmentioning
confidence: 99%
“…In audio-visual speech recognition, there are mainly three integration methods; early integration [1] that connects the audio feature vector with the visual feature vector, late integration [2] that weights the likelihood of the result obtained by a separate process for audio and visual signals, and synthetic integration [3] that calculates product of output probability in each state and so on. The research to lip-reading only in the visual feature is actively advanced because the visual feature, of course the audio feature, greatly influences the recognition rate in these processing.…”
Section: Introductionmentioning
confidence: 99%
“…Using combined audio and visual features, recognition performance was improved by a maximum of 10% at high and low SNR's over an audio-only recogniser. Future work will focus on finding more effective ways of combining the audio and visual information with the aim of ensuring that the combined performance is always at least as good as the performance using either modality [1,14,16,17] and in deriving more discriminative features from the scale histogram.…”
Section: Discussionmentioning
confidence: 99%
“…It has already been shown [1,6,8,10,13,15,16,17] that the incorporation of visual information with acoustic speech recognition leads to a more robust recogniser. While the visual cues of speech alone are unable to discriminate between all phonemes (e.g.…”
Section: Introductionmentioning
confidence: 99%