Social communication draws on several cognitive functions such as perception, emotion recognition and attention. In a previous study, we demonstrated that macaques associate audio-visual information when processing their species-specific communicative signals. Specifically, cortical activation is inhibited when there is a mismatch between vocalisations and social visual information whereas activation is enhanced in the lateral sulcus, superior temporal sulcus as well as a larger network composed of early visual and prefrontal areas when vocalisations and social visual information match. Here, we use a similar task and functional magnetic resonance imaging to assess the role of subcortical structures. We identify three subcortical regions involved in audio-visual processing of species-specific communicative signal: the amygdala, the claustrum and the pulvinar. Like the cortex, these subcortical structures are not activated when there is a mismatch between visual and acoustic information. In contrast, the amygdala and claustrum are activated by visual, auditory congruent and audio-visual stimulations. The pulvinar responds in a task-dependent manner, along a specific spatial sensory gradient. Anterior pulvinar responds to auditory stimuli, medial pulvinar is activated by auditory, audio-visual and visual stimuli and the dorsal lateral pulvinar only responds to visual stimuli in a pure visual task. The medial pulvinar and the amygdala are the only subcortical structures integrating audio-visual social stimuli. We propose that these three structures belong to a multisensory network that modulates the perception of visual socioemotional information and vocalizations as a function of the relevance of the stimuli in the social context.