Four experiments examined the nature of multisensory speech information. In Experiment 1, participants were asked to match heard voices with dynamic visual-alone video clips of speakers' articulating faces. This cross-modal matching task was used to examine whether vocal source matching can be accomplished across sensory modalities. The results showed that observers could match speaking faces and voices, indicating that information about the speaker was available for cross-modal comparisons. In a series of follow-up experiments, several stimulus manipulations were used to determine some of the critical acoustic and optic patterns necessary for specifying cross-modal source information. The results showed that cross-modal source information was not available in static visual displays of faces and was not contingent on a prominent acoustic cue to vocal identity (f0). Furthermore, cross-modal matching was not possible when the acoustic signal was temporally reversed.
The present study examined how postlingually deafened adults with cochlear implants combine visual information from lipreading with auditory cues in an open-set word recognition task. Adults with normal hearing served as a comparison group. Word recognition performance was assessed using lexically controlled word lists presented under auditory-only, visual-only, and combined audiovisual presentation formats. Effects of talker variability were studied by manipulating the number of talkers producing the stimulus tokens. Lexical competition was investigated using sets of lexically easy and lexically hard test words. To assess the degree of audiovisual integration, a measure of visual enhancement, R a , was used to assess the gain in performance provided in the audiovisual presentation format relative to the maximum possible performance obtainable in the auditory-only format. Results showed that word recognition performance was highest for audiovisual presentation followed by auditory-only and then visual-only stimulus presentation. Performance was better for single-talker lists than for multiple-talker lists, particularly under the audiovisual presentation format. Word recognition performance was better for the lexically easy than for the lexically hard words regardless of presentation format. Visual enhancement scores were higher for single-talker conditions compared to multiple-talker conditions and tended to be somewhat better for lexically easy words than for lexically hard words. The pattern of results suggests that information from the auditory and visual modalities is used to access common, multimodal lexical representations in memory. The findings are discussed in terms of the complementary nature of auditory and visual sources of information that specify the same underlying gestures and articulatory events in speech.Keywords cochlear implants; hearing impairment; speech perception; audiovisual Cochlear implants (CIs) are electronic auditory prostheses for individuals with severe-toprofound hearing impairment that enable many of them to perceive and understand spoken language. However, the benefit to an individual user varies greatly. Auditory-alone NIH-PA Author ManuscriptNIH-PA Author Manuscript NIH-PA Author Manuscriptperformance measures have demonstrated that some CI users are able to communicate successfully over a telephone even when lipreading cues are unavailable (e.g., Dorman, Dankowski, McCandless, Parkin, & Smith, 1991). Other CI users display little benefit in open-set speech perception tests under auditory-alone listening conditions, but find that the CI helps them understand speech when visual information also is available. One source of these individual differences is undoubtedly the way in which the surviving neural elements in the cochlea are stimulated with electrical currents provided by the speech processor (Fryauf-Bertschy, Tyler, Kelsay, Gantz, & Woodworth, 1997). Other sources of variability may result from the way in which these initial sensory inputs are coded and processe...
The relationships observed between auditory-alone speech perception, audiovisual benefit, and speech intelligibility indicate that these abilities are not based on independent language skills, but instead reflect a common source of linguistic knowledge, used in both perception and production, that is based on the dynamic, articulatory motions of the vocal tract. The effects of communication mode demonstrate the important contribution of early sensory experience to perceptual development, specifically, language acquisition and the use of phonological processing skills. Intervention and treatment programs that aim to increase receptive and productive spoken language skills, therefore, may wish to emphasize the inherent cross-correlations that exist between auditory and visual sources of information in speech perception.
Information about the acoustic properties of a talker's voice is available in optical displays of speech, and vice versa, as evidenced by perceivers' ability to match faces and voices based on vocal identity. The present investigation used point-light displays (PLDs) of visual speech and sinewave replicas of auditory speech in a cross-modal matching task to assess perceivers' ability to match faces and voices under conditions when only isolated kinematic information about vocal tract articulation was available. These stimuli were also used in a word recognition experiment under auditory-alone and audiovisual conditions. The results showed that isolated kinematic displays provide enough information to match the source of an utterance across sensory modalities. Furthermore, isolated kinematic displays can be integrated to yield better word recognition performance under audiovisual conditions than under auditory-alone conditions. The results are discussed in terms of their implications for describing the nature of speech information and current theories of speech perception and spoken word recognition.
In a cross-modal matching task, participants were asked to match visual and auditory displays of speech based on the identity of the speaker. The present investigation used this task with acoustically transformed speech to examine the properties of sound that can convey cross-modal information. Word recognition performance was also measured under the same transformations. The authors found that cross-modal matching was only possible under transformations that preserved the relative spectral and temporal patterns of formant frequencies. In addition, cross-modal matching was only possible under the same conditions that yielded robust word recognition performance. The results are consistent with the hypothesis that acoustic and optical displays of speech simultaneously carry articulatory information about both the underlying linguistic message and indexical properties of the talker.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.