No abstract
Visual speech facilitates auditory speech perception, but the visual cues responsible for these effects and the crossmodal information they provide remain unclear. Because visible articulators shape the spectral content of auditory speech, we hypothesized that listeners may be able to extract spectrotemporal information from visual speech to facilitate auditory speech perception. To uncover statistical regularities that could subserve such facilitations, we compared the resonant frequency of the oral cavity to the shape of the oral aperture during speech. We found that the time-frequency dynamics of oral resonances could be recovered with unexpectedly high precision from the shape of the mouth during speech. Because both auditory frequency modulations and visual shape properties are neurally encoded as mid-level perceptual features, we hypothesized that this feature-level correspondence would allow for spectrotemporal information to be recovered from visual speech without reference to higher order (e.g., phonemic) speech representations. Isolating these features from other speech cues, we found that speech-based shape deformations improved sensitivity for corresponding frequency modulations, suggesting that the perceptual system exploits crossmodal correlations in mid-level feature representations to enhance speech perception. To test whether this correspondence could be used to improve comprehension, we selectively degraded the spectral or temporal dimensions of auditory sentence spectrograms to assess how well visual speech facilitated comprehension under each degradation condition. Visual speech produced drastically larger enhancements during spectral degradation, suggesting a condition-specific facilitation effect driven by crossmodal recovery of auditory speech spectra. Visual speech may therefore facilitate perception by crossmodally restoring degraded spectrotemporal signals in speech.
Speech perception is a central component of social communication. While principally an auditory process, accurate speech perception in everyday settings is supported by meaningful information extracted from visual cues (e.g., speech content, timing, and speaker identity). Previous research Auditory speech signals are conveyed rapidly during natural speech (3-7 syllables per second; Chandrasekaran et al., 2009), making the identification of individual speech sounds a computationally challenging task (Elliott and Theunissen, 2009). Easing the complexity of this process, audiovisual signals during face-to-face communication help predict and constrain perceptual inferences about speech sounds in both a bottom-up and top-down manner (Bernstein
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.