Integrating audio and visual information to provide highly robust speech recognition

Tomlinson, M.; Russell, Martin J.; Brooke, Nicola

doi:10.1109/icassp.1996.543247

Cited by 84 publications

(72 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Furthermore, temporal information between the channels is lost in this approach. AVSR systems based on EI models have for example been described in 6,32] and systems based on LI models in 27,30]. Although it is still not well known how h umans integrate di erent modalities, it is generally agreed that integration occurs before speech is categorised phonetically 5, 3 1 ].…”

Section: Audio-visual Sensor Integrationmentioning

confidence: 99%

Continuous audio-visual speech recognition

Luettin

Dupont

1998

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. We address the problem of robust lip tracking, visual speech feature extraction, and sensor integration for audio-visual speech recognition applications. An appearance based model of the articulators, which represents linguistically important features, is learned from example images and is used to locate, track, and recover visual speech information. We t a c kle the problem of joint temporal modelling of the acoustic and visual speech signals by applying Multi-Stream hidden Markov models. This approach allows the use of di erent temporal topologies and levels of stream integration and hence enables to model temporal dependencies more accurately. T h e system has been evaluated for a continuously spoken digit recognition task of 37 subjects. IDIAP{RR 98-02

show abstract

Section: Audio-visual Sensor Integrationmentioning

confidence: 99%

Continuous audio-visual speech recognition

Luettin

Dupont

1998

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…In audio-visual speech recognition, there are mainly three integration methods; early integration [1] that connects the audio feature vector with the visual feature vector, late integration [2] that weights the likelihood of the result obtained by a separate process for audio and visual signals, and synthetic integration [3] that calculates product of output probability in each state and so on. The research to lip-reading only in the visual feature is actively advanced because the visual feature, of course the audio feature, greatly influences the recognition rate in these processing.…”

Section: Introductionmentioning

confidence: 99%

Audio-Visual Speech Recognition Based on AAM Parameter and Phoneme Analysis of Visual Feature

Komai

Ariki

Takiguchi

2011

Advances in Image and Video Technology

View full text Add to dashboard Cite

Abstract. As one of the techniques for robust speech recognition under noisy environment, audio-visual speech recognition using lip dynamic visual information together with audio information is attracting attention and the research is advanced in recent years. Since visual information plays a great role in audio-visual speech recognition, what to select as the visual feature becomes a significant point. This paper proposes, for spoken word recognition, to utilize c combined parameter(combined parameter) as the visual feature extracted by Active Appearance Model applied to a face image including the lip area. Combined parameter contains information of the coordinate value and the intensity value as the visual feature. The recognition rate was improved by the proposed feature compared to the conventional features such as DCT and the principal component score. Finally, we integrated the phoneme score from audio information and the viseme score from visual information with high accuracy.

show abstract

“…Using combined audio and visual features, recognition performance was improved by a maximum of 10% at high and low SNR's over an audio-only recogniser. Future work will focus on finding more effective ways of combining the audio and visual information with the aim of ensuring that the combined performance is always at least as good as the performance using either modality [1,14,16,17] and in deriving more discriminative features from the scale histogram.…”

Section: Discussionmentioning

confidence: 99%

“…It has already been shown [1,6,8,10,13,15,16,17] that the incorporation of visual information with acoustic speech recognition leads to a more robust recogniser. While the visual cues of speech alone are unable to discriminate between all phonemes (e.g.…”

Section: Introductionmentioning

confidence: 99%

Audiovisual speech recognition using multiscale nonlinear image decomposition

Matthews

Bangham²,

Cox³

Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96

View full text Add to dashboard Cite

There has recently been increasing interest in the idea of enhancing speech recognition by the use of visual information derived from the face of the talker. This paper demonstrates the use of nonlinear image decomposition, in the form of a 'sieve', applied to the task of visual speech recognition. Information derived from the mouth region is used in visual and audiovisual speech recognition of a database of the letters A-Z for four talkers. A scale histogram is generated directly from the grayscale pixels of a window containing the talkers mouth on a per frame basis. Results are presented for visual-only, audio-only and in a simple audiovisual case.

show abstract

Integrating audio and visual information to provide highly robust speech recognition

Cited by 84 publications

References 5 publications

Continuous audio-visual speech recognition

Continuous audio-visual speech recognition

Audio-Visual Speech Recognition Based on AAM Parameter and Phoneme Analysis of Visual Feature

Audiovisual speech recognition using multiscale nonlinear image decomposition

Contact Info

Product

Resources

About