According to the predictive coding theory, top-down predictions are conveyed by backward connections and prediction errors are propagated forward across the cortical hierarchy. Using MEG in humans, we show that violating multisensory predictions causes a fundamental and qualitative change in both the frequency and spatial distribution of cortical activity. When visual speech input correctly predicted auditory speech signals, a slow delta regime (3-4 Hz) developed in higher-order speech areas. In contrast, when auditory signals invalidated predictions inferred from vision, a low-beta (14-15 Hz) / high-gamma (60-80 Hz) coupling regime appeared locally in a multisensory area (area STS). This frequency shift in oscillatory responses scaled with the degree of audio-visual congruence and was accompanied by increased gamma activity in lower sensory regions. These findings are consistent with the notion that bottom-up prediction errors are communicated in predominantly high (gamma) frequency ranges, whereas top-down predictions are mediated by slower (beta) frequencies.© 2011 Nature America, Inc. All rights reserved.7 9 8 VOLUME 14 | NUMBER 6 | JUNE 2011 nature neurOSCIenCe a r t I C l e S converge-that is, where multisensory predictions are generatedwhereas gamma activity was seen in lower sensory cortices where prediction errors emerge and are propagated forward.
RESULTSWe presented 15 subjects with stimuli in one of three conditions: videos (audio-visual: AV condition) of a speaker pronouncing the syllables /pa/, / a/, /la/, /ta/, /ga/ and /fa/ (International Phonetic Alphabet notation); an auditory track of these videos combined with a still face (auditory: A condition); or a mute visual track (visual: V condition). The videos could be either natural or a random combination of auditory and visual tracks, creating conditions in which auditory and visual tracks were congruent (AVc condition) and ones in which they were incongruent (AVi condition; see Online Methods and Supplementary Fig. 1). Incongruent combinations yielding fusion illusory percepts-that is, McGurk stimuli-were excluded 8,11 . Subjects performed an unrelated target detection task on the syllable /fa/ that was presented in A, V or AVc form in 13% of the trials (97% correct detection). These trials were subsequently excluded from the analyses. The five other syllables were chosen because they yielded graded recognition accuracy when presented visually (Fig. 1a), resulting from an increasing predictiveness 10 . The phonological prediction conveyed by mouth movements (visemes) varies in specificity depending on the pronounced syllable. Typically, syllables beginning with a consonant that is formed at the front of the mouth (/p/, /m/) convey a more specific prediction than those formed at the back (/g/, /k/, /r/, /l/) 8 . Our second experimental factor pertained to the validity of the visual prediction with respect to the auditory input. Physically, the audio-visual stimuli could be either congruent (valid prediction) or incongruent (invalid prediction), w...