Human face-to-face communication is essentially audiovisual. Typically, people talk to us face-to-face, providing concurrent auditory and visual input. Understanding someone is easier when there is visual input, because visual cues like mouth and tongue movements provide complementary information about speech content. Here, we hypothesized that, even in the absence of visual input, the brain optimizes both auditory-only speech and speaker recognition by harvesting speaker-specific predictions and constraints from distinct visual face-processing areas. To test this hypothesis, we performed behavioral and neuroimaging experiments in two groups: subjects with a face recognition deficit (prosopagnosia) and matched controls. The results show that observing a specific person talking for 2 min improves subsequent auditory-only speech and speaker recognition for this person. In both prosopagnosics and controls, behavioral improvement in auditory-only speech recognition was based on an area typically involved in face-movement processing. Improvement in speaker recognition was only present in controls and was based on an area involved in face-identity processing. These findings challenge current unisensory models of speech processing, because they show that, in auditory-only speech, the brain exploits previously encoded audiovisual correlations to optimize communication. We suggest that this optimization is based on speaker-specific audiovisual internal models, which are used to simulate a talking face.fMRI ͉ multisensory ͉ predictive coding ͉ prosopagnosia H uman face-to-face communication works best when one can watch the speaker's face (1). This becomes obvious when someone speaks to us in a noisy environment, in which the auditory speech signal is degraded. Visual cues place constraints on what our brain expects to perceive in the auditory channel. These visual constraints improve the recognition rate for audiovisual speech, compared with auditory speech alone (2). Similarly, speaker identity recognition by voice can be improved by concurrent visual information (3). Accordingly, audiovisual models of human voice and face perception posit that there are interactions between auditory and visual processing streams (Fig. 1A) (4, 5).Based on prior experimental (6-8) and theoretical work (9-12) we hypothesized that, even in the absence of visual input, the brain optimizes auditory-only speech and speaker recognition by harvesting predictions and constraints from distinct visual face areas (Fig. 1B).Experimental studies (6,8) demonstrated that the identification of a speaker by voice is improved after a brief audiovisual experience with that speaker (in contrast to a matched control condition). The improvement effect was paralleled by an interaction of voice and face-identity sensitive areas (8). This finding suggested that the associative representation of a particular face facilitates the recognition of that person by voice. However, it is unclear whether this effect also extends to other audiovisual dependencies in human ...