Speech perception is a central component of social communication. While principally an auditory process, accurate speech perception in everyday settings is supported by meaningful information extracted from visual cues (e.g., speech content, timing, and speaker identity). Previous research Auditory speech signals are conveyed rapidly during natural speech (3-7 syllables per second; Chandrasekaran et al., 2009), making the identification of individual speech sounds a computationally challenging task (Elliott and Theunissen, 2009). Easing the complexity of this process, audiovisual signals during face-to-face communication help predict and constrain perceptual inferences about speech sounds in both a bottom-up and top-down manner (Bernstein
Speech perception is a central component of social communication. While speech perception is primarily driven by sounds, accurate perception in everyday settings is also supported by meaningful information extracted from visual cues (e.g., speech content, timing, and speaker identity). Previous research has shown that visual speech modulates activity in cortical areas subserving auditory speech perception, including the superior temporal gyrus (STG), likely through feedback connections from the multisensory posterior superior temporal sulcus (pSTS). However, it is unknown whether visual modulation of auditory processing in the STG is a unitary phenomenon or, rather, consists of multiple temporally, spatially, or functionally discrete processes. To explore these questions, we examined neural responses to audiovisual speech in electrodes implanted intracranially in the temporal cortex of 21 patients undergoing clinical monitoring for epilepsy. We found that visual speech modulates auditory processes in the STG in multiple ways, eliciting temporally and spatially distinct patterns of activity that differ across theta, beta, and high-gamma frequency bands. Before speech onset, visual information increased high-gamma power in the posterior STG and suppressed beta power in mid-STG regions, suggesting crossmodal prediction of speech signals in these areas. After sound onset, visual speech decreased theta power in the middle and posterior STG, potentially reflecting a decrease in sustained feedforward auditory activity. These results are consistent with models that posit multiple distinct mechanisms supporting audiovisual speech perception.
Face-to-face communication improves the quality and accuracy of heard speech, particularly in noisy environments. Silent lipreading modulates activity in auditory regions, which has been hypothesized to reflect the transformation and encoding of multiple forms of visual speech information used to support hearing processes. Evidence suggests visual timing information as one such signal encoded in auditory areas: seeing when a speaker's lips come together between words can help listeners parse word-level boundaries. However, it remains unclear how lipreading alters activity in the auditory system to improve speech perception at the single word-level. Using fMRI and intracranial electrodes in patients, here we show that silently lipread words can be classified from neural activity in auditory areas based on distributed spatial information. Lipread words evoked similar representations to the corresponding heard words, consistent with the prediction that automatic lipreading refines the tuning of auditory representations. Similar to heard words, lipread words varied in the distinctiveness of their neural representations in auditory cortex: e.g., the lipread words DIG and GIG evoked more similar neural activity in auditory cortex relative to the more perceptually distinct word FIG, suggesting that lipreading activity reflects probabilistic distributions as opposed to the unique identity of the lipread word. Notably, while visual speech has both excitatory and suppressive effects on auditory firing rates, classification was observed in both neural populations, consistent with the prediction that lipreading contributes to phoneme population tuning by both activating the corresponding representation and suppressing incorrect phonemic representations. These results support a model in which the auditory system combines the joint neural distributions evoked by heard and lipread words to generate a more precise estimate of what was said, particularly during noisy speech.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.