Synchronous presentation of stimuli to the auditory and visual systems can modify the formation of a percept in either modality. For example, perception of auditory speech is improved when the speaker's facial articulatory movements are visible. Neural convergence onto multisensory sites exhibiting supra-additivity has been proposed as the principal mechanism for integration. Recent findings, however, have suggested that putative sensory-specific cortices are responsive to inputs presented through a different modality. Consequently, when and where audiovisual representations emerge remain unsettled. In combined psychophysical and electroencephalography experiments we show that visual speech speeds up the cortical processing of auditory signals early (within 100 ms of signal onset). The auditory-visual interaction is reflected as an articulator-specific temporal facilitation (as well as a nonspecific amplitude reduction). The latency facilitation systematically depends on the degree to which the visual signal predicts possible auditory targets. The observed auditory-visual data support the view that there exist abstract internal representations that constrain the analysis of subsequent speech inputs. This is evidence for the existence of an ''analysis-by-synthesis'' mechanism in auditory-visual speech perception. combinations'' such as ''pk '' or ''kp '' but never a fused percept. These results illustrate the effect of input modality on the perceptual AV speech outcome and suggest that multisensory percept formation is systematically based on the informational content of the inputs. In classic speech theories, however, visual speech has seldom been accounted for as a natural source of speech input. Ultimately, when in the processing stream (i.e., at which representational stage) sensory-specific information fuses to yield unified percepts is fundamental for any theoretical, computational, and neuroscientific accounts of speech perception.Recent investigations of AV speech are based on hemodynamic studies that cannot speak directly to timing issues (2, 3). Electroencephalographic (EEG) and magnetoencephalographic (4-7) studies testing AV speech integration have typically used oddball or mismatch negativity paradigms, thus the earliest AV speech interactions have been reported for the 150-to 250-ms mismatch response. Whether systematic AV speech interactions can be documented earlier is controversial, although nonspeech effects can be observed early (8).
AV Speech as a Multisensory ProblemSeveral properties of speech are relevant to the present study. (i) Because AV speech is ecologically valid for humans (9, 10), one might predict an involvement of specialized neural computations capable of handling the spectrotemporal complexity of AV speech (compared to, say, arbitrary tone-flash pairings), for which no natural functional relevance can be assumed. (ii) Natural AV speech is characterized by particular dynamics such as (a) the temporal precedence of visual speech (the movement of the facial articulators typically ...