The Time Course of Audio-Visual Phoneme Identification: a High Temporal Resolution Study

Sánchez-García, Carolina; Kandel, Sonia; Savariaux, Christophe; Soto‐Faraco, Salvador

doi:10.1163/22134808-00002560

Cited by 12 publications

(10 citation statements)

References 58 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In other words, when presented visually with a facial emotional expression, we need a shorter exposure to the stimulus compared to the matching vocal expression in order to reach the same efficient discrimination. In the field of word recognition, where the gating task has been more widely exploited, it has been shown that when discriminating a set of words that only differ by one phoneme, the performance can be better either in the visual or in the auditory domain depending on the saliency of the modality for each specific phoneme, and multisensory integration does not necessarily lead to a more successful discrimination (Sánchez-García et al 2018). Instead, in the context of emotion expressions' discrimination, our results robustly show that a multimodal context is always advantageous, and a discriminatory decision is reached earlier than in either unisensory condition.…”

Section: Discussionmentioning

confidence: 99%

“…Indeed, emotion expression from the face and voice are intrinsically time-embedded and the accumulation of sensory evidence allowing for a reliable decision about which emotion is being displayed may vary across the senses and across emotions. The present study has therefore been designed with the scope of evaluating how observers accumulate informational evidence at different time points during the unfolding of dynamic visual, auditory and bimodal emotional signals, using a gating paradigm (Grosjean 1980;Jesse and Massaro 2010;Sánchez-García et al 2018). Building on existing cognitive models of speech perception (Marslen-Wilson and Welsh 1978;Marslen-Wilson 1987;Davis et al 2002), we assume that when a perceiver performs the extraction of emotional information from the face and/or voice of an interlocutor, discrete stored properties of each emotion matching the incoming expression are rapidly and partially activated, similarly to the activation of the "word initial cohort" when hearing incoming speech (Marslen-Wilson 1987).…”

Section: Introductionmentioning

confidence: 99%

“…We hypothesize that a multisensory context in which both facial and vocal information is present will lead to a successful recognition with shorter exposure time than in any unimodal condition. In order to investigate this hypothesis, we empirically determined the threshold for discrimination, or Isolation Point (IP), through a psychophysical gating experiment (Grosjean 1980;Sánchez-García et al 2018) in which the perceiver had to distinguish the emotion contained in increasingly larger segments of vocal, facial or bimodal congruent stimuli. In addition, we investigated whether discrimination follows a similar confusion pattern across our emotional set between the two modalities, and whether the confusion pattern in the audiovisual setting could be explained by the ones in the unimodal perception conditions.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Time-resolved discrimination of audio-visual emotion expressions

Falagiarda

Collignon

2019

Cortex

View full text Add to dashboard Cite

Humans seamlessly extract and integrate the emotional content delivered by the face and the voice of others. It is however poorly understood how perceptual decisions unfold in time when people discriminate the expression of emotions transmitted using dynamic facial and vocal signals, as in natural social context. In this study, we relied on a gating paradigm to track how the recognition of emotion expressions across the senses unfold over exposure time. We first demonstrate that across all emotions tested, a discriminatory decision is reached earlier with faces than with voices.Importantly, multisensory stimulation consistently reduced the required accumulation of perceptual evidences needed to reach correct discrimination (Isolation Point). We also observed that expressions with different emotional content provide cumulative evidence at different speeds, with "fear" being the expression with the fastest isolation point across the senses. Finally, the lack of correlation between the confusion patterns in response to facial and vocal signals across time suggest distinct relations between the discriminative features extracted from the two signals. All together, these results provide a comprehensive view on how auditory, visual and audiovisual information related to different emotion expressions accumulate in time, highlighting how multisensory context can fasten the discrimination process when minimal information is available.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Time-resolved discrimination of audio-visual emotion expressions

Falagiarda

Collignon

2019

Cortex

View full text Add to dashboard Cite

show abstract

“…While the CIMS model is agnostic with respect to neuroanatomy, BOLD fMRI and modeling suggest that there are anatomical dissociations between brain areas responsible for sensory encoding and those responsible for causal inference judgments (Rohe and Noppeney, 2015;Cuppini et al, 2017) making it reasonable that individual differences in one computation are uncoupled from individual differences in the other. Findings that the McGurk effect shows different neural signatures than congruent audiovisual syllables (Erickson et al, 2014;Moris Fernandez et al, 2017;Sánchez-García et al, 2018) has been used as evidence that the McGurk effect is processed differently than everyday speech. The CIMS model clarifies that…”

Section: Relating the Mcgurk Effect To Other Speech Perception Tasksmentioning

confidence: 99%

Causal inference explains the stimulus-level relationship between the McGurk Effect and auditory speech perception

Wegner-Clemens

et al. 2020

Preprint

View full text Add to dashboard Cite

Acknowledgments: This research was supported by NIH R01NS065395. Abstract The McGurk effect is widely used as a measure of multisensory integration during speech perception. Two observations have raised questions about the relationship between the effect and everyday speech perception. First, there is high variability in the strength of the McGurk effect across different stimuli and observers. Second, there is low correlation across observers between perception of the McGurk effect and measures of everyday speech perception, such as the ability to understand noisy audiovisual speech. Using the framework of the causal inference of multisensory speech (CIMS) model, we explored the relationship between the McGurk effect, syllable perception, and sentence perception in seven experiments with a total of 296 different participants. Perceptual reports revealed a relationship between the efficacy of different McGurk stimuli created from the same talker and perception of the auditory component of the McGurk stimuli presented in isolation, either with or without added noise. The CIMS model explained this high stimulus-level correlation using the principles of noisy sensory encoding followed by optimal cue combination within a representational space that was identical for McGurk and everyday speech. In other experiments, CIMS successfully modeled low observer-level correlation between McGurk and everyday speech. Variability in noisy speech perception was modeled using individual differences in noisy sensory encoding, while variability in McGurk perception involved additional differences in causal inference. Participants with all combinations of high and low sensory encoding noise and high and low causal inference disparity thresholds were identified. Perception of the McGurk effect and everyday speech can be explained by a common theoretical framework that includes causal inference.

show abstract

“…The conceptual model explains the absence of multisensory benefit for voice-leading speech because of the lack of a perceptual head start provided by visual speech, suggesting a number of interesting experiments. Voice-leading speech could be transformed by experimentally manipulating auditory-visual asynchrony, advancing the visual portion of the recording and rendering it effectively "mouth-leading" (Magnotti et al, 2013;Sánchez-García et al, 2018).…”

Section: Model Predictions and Summarymentioning

confidence: 99%

Cross-modal Suppression of Auditory Association Cortex by Visual Speech as a Mechanism for Audiovisual Speech Perception

Karas¹,

Jf²,

Ba³

et al. 2019

Preprint

View full text Add to dashboard Cite

Acknowledgments: This research was supported by NIH R01NS065395 and R25NS070694.Impact Statement: Human perception and brain responses differ between words in which mouth movements are visible before the voice is heard and words for which the reverse is true. AbstractVision provides a perceptual head start for speech perception because most speech is "mouthleading": visual information from the talker's mouth is available before auditory information from the voice. However, some speech is "voice-leading" (auditory before visual). Consistent with a model in which vision modulates subsequent auditory processing, there was a larger perceptual benefit of visual speech for mouth-leading vs. voice-leading words (28% vs. 4%). The neural substrates of this difference were examined by recording broadband high-frequency activity from electrodes implanted over auditory association cortex in the posterior superior temporal gyrus (pSTG) of epileptic patients. Responses were smaller for audiovisual vs.auditory-only mouth-leading words (34% difference) while there was little difference (5%) for voice-leading words. Evidence for cross-modal suppression of auditory cortex complements our previous work showing enhancement of visual cortex (Ozker et al., 2018b) and confirms that multisensory interactions are a powerful modulator of activity throughout the speech perception network.

show abstract

The Time Course of Audio-Visual Phoneme Identification: a High Temporal Resolution Study

Abstract: International audienc

Cited by 12 publications

References 58 publications

Time-resolved discrimination of audio-visual emotion expressions

Time-resolved discrimination of audio-visual emotion expressions

Causal inference explains the stimulus-level relationship between the McGurk Effect and auditory speech perception

Cross-modal Suppression of Auditory Association Cortex by Visual Speech as a Mechanism for Audiovisual Speech Perception

Contact Info

Product

Resources

About