A comparison of models for fusion of the auditory and visual sensors in speech perception

Robert-Ribes, Jordi; Schwartz, Jean‐Luc; Escudier, Pierre

doi:10.1007/bf00849043

Cited by 25 publications

(9 citation statements)

References 56 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Studies of cross-modal integration have focused both on the effect of congruency, when the audio and video stimuli are taken from the same or different utterances (McGurk and MacDonald 1976; Summerfield and McGrath 1984), the same or different speakers (Kamachi et al 2003), and also on the role of temporal synchrony, when the audio and video stimuli are congruent but temporally misaligned (Dixon and Spitz 1980; McGrath and Summerfield 1985). All of these studies have found consistent improvements in perception when visual information is used to supplement auditory information, even when the information in the video signal is imperfect or even partially inconsistent with the audio signal (Summerfield 1987; Robert-Ribes et al 1995). …”

mentioning

confidence: 94%

Cortical integration of audio–visual speech and non-speech stimuli

Wyk

Ramsay

Hudac

et al. 2010

Brain and Cognition

View full text Add to dashboard Cite

We investigated the neural basis of audio-visual processing in speech and non-speech stimuli. Physically identical auditory stimuli (speech and sinusoidal tones) and visual stimuli (animated circles and ellipses) were used in this fMRI experiment. Relative to unimodal stimuli, each of the multimodal conjunctions showed increased activation in largely non-overlapping areas. The conjunction of Ellipse and Speech, which most resembles naturalistic audiovisual speech, showed higher activation in the right inferior frontal gyrus, fusiform gyri, left posterior superior temporal sulcus, and lateral occipital cortex. The conjunction of Circle and Tone, an arbitrary audio-visual pairing with no speech association, activated middle temporal gyri and lateral occipital cortex. The conjunction of Circle and Speech showed activation in lateral occipital cortex, and the conjunction of Ellipse and Tone did not show increased activation relative to unimodal stimuli. Further analysis revealed that middle temporal regions, although identified as multimodal only in the Circle-Tone condition, were more strongly active to Ellipse-Speech or Circle-Speech, but regions that were identified as multimodal for Ellipse-Speech were always strongest for Ellipse-Speech. Our results suggest that combinations of auditory and visual stimuli may together be processed by different cortical networks, depending on the extent to which speech or non-speech percepts are evoked.

show abstract

mentioning

confidence: 94%

Cortical integration of audio–visual speech and non-speech stimuli

Wyk

Ramsay

Hudac

et al. 2010

Brain and Cognition

View full text Add to dashboard Cite

show abstract

“…Findings from other areas of research have suggested the existence of a mechanism or representation common to the processing of speech input from the auditory and visual modalities (Campbell, 1987;Watson, Qiu, Chamberlain, & Li, 1996). Cross-modal interaction at some level has demonstrated that information from different sensory modalities can be combined in perception, as in the McGurk effect (e.g., McGurk & MacDonald, 1976), and that input to one modality can influence processing in another (see, e.g., Robert-Ribes, Schwartz, & Escudier, 1995, for a review). Using magnetoencephalographic recordings, visual input specifically from lip movements was found to influence auditory cortical activity (Sams, Aulanko, Hämäläinen, Hari, Lounasmaa, Lu, & Simola, 1991).…”

mentioning

confidence: 99%

Acquisition of second-language speech: Effects of visual cues, context, and talker variability

Hardison

2003

Applied Psycholinguistics

152

143

View full text Add to dashboard Cite

The influence of a talker's face (e.g., articulatory gestures) and voice, vocalic context, and word position were investigated in the training of Japanese and Korean English as a second language learners to identify American English /R/ and /l/. In the pretest-posttest design, an identification paradigm assessed the effects of 3 weeks of training using multiple natural exemplars on videotape. Word position, adjacent vowel, and training type (auditory-visual [AV] vs. auditory only; multiple vs. single talker for Koreans) were independent variables. Findings revealed significant effects of training type (greater improvement with AV), talker, word position, and vowel. Identification accuracy generalized successfully to novel stimuli and a new talker. Transfer to significant production improvement was also noted. These findings are compatible with episodic models for the encoding of speech in memory.

show abstract

“…As reported in [29], both vocal intonations and facial expressions determine the listener's affective state in up to 93% of cases. Recently, increased attention has been paid to analyzing multimodal information in emotion recognition (e.g., [1,7,[9][10][11][12][13][30][31][32][33][34]). However, most of them still use deliberate and often exaggerated facial displays (e.g., [2,5]).…”

Section: Introductionmentioning

confidence: 99%

Utterance independent bimodal emotion recognition in spontaneous communication

Tao

Pan

Yang

et al. 2011

EURASIP J. Adv. Signal Process.

View full text Add to dashboard Cite

Emotion expressions sometimes are mixed with the utterance expression in spontaneous face-to-face communication, which makes difficulties for emotion recognition. This article introduces the methods of reducing the utterance influences in visual parameters for the audio-visual-based emotion recognition. The audio and visual channels are first combined under a Multistream Hidden Markov Model (MHMM). Then, the utterance reduction is finished by finding the residual between the real visual parameters and the outputs of the utterance related visual parameters. This article introduces the Fused Hidden Markov Model Inversion method which is trained in the neutral expressed audio-visual corpus to solve the problem. To reduce the computing complexity the inversion model is further simplified to a Gaussian Mixture Model (GMM) mapping. Compared with traditional bimodal emotion recognition methods (e.g., SVM, CART, Boosting), the utterance reduction method can give better results of emotion recognition. The experiments also show the effectiveness of our emotion recognition system when it was used in a live environment.

show abstract

A comparison of models for fusion of the auditory and visual sensors in speech perception

Cited by 25 publications

References 56 publications

Cortical integration of audio–visual speech and non-speech stimuli

Cortical integration of audio–visual speech and non-speech stimuli

Acquisition of second-language speech: Effects of visual cues, context, and talker variability

Utterance independent bimodal emotion recognition in spontaneous communication

Contact Info

Product

Resources

About