Supramodal representation of emotion and its neural substrates have recently attracted attention as a marker of social cognition. However, the question whether perceptual integration of facial and vocal emotions takes place in primary sensory areas, multimodal cortices, or in affective structures remains unanswered yet. Using novel computer-generated stimuli, we combined emotional faces and voices in congruent and incongruent ways and assessed functional brain data (fMRI) during an emotional classification task. Both congruent and incongruent audiovisual stimuli evoked larger responses in thalamus and superior temporal regions compared with unimodal conditions. Congruent emotions were characterized by activation in amygdala, insula, ventral posterior cingulate (vPCC), temporo-occipital, and auditory cortices; incongruent emotions activated a frontoparietal network and bilateral caudate nucleus, indicating a greater processing load in working memory and emotion-encoding areas. The vPCC alone exhibited differential reactions to congruency and incongruency for all emotion categories and can thus be considered a central structure for supramodal representation of complex emotional information. Moreover, the left amygdala reflected supramodal representation of happy stimuli. These findings document that emotional information does not merge at the perceptual audiovisual integration level in unimodal or multimodal areas, but in vPCC and amygdala.