Social interactions rely on the ability to interpret semantic and emotional information, often from multiple sensory modalities. In human and nonhuman primates, both the auditory and visual modalities are used to generate and interpret communicative signals. In individuals with autism, not only are there deficits in social communication, but in the integration of audio-visual information. At present, we know little about the neural mechanisms that subserve the interpretation of complex social events, including the audio-visual integration that is often required with accompanying communicative signals. Based on heart rate estimates and fMRI in two macaque monkeys (Macaca mulatta), we show that individuals systematically associate affiliative facial expressions or social scenes with corresponding affiliative vocalizations, aggressive facial expressions or social scenes with corresponding aggressive vocalizations and escape visual scenes with scream vocalizations. In contrast, vocalizations that are incompatible with the visual information are fully suppressed, suggesting top-down regulation over the processing of sensory input. The process of binding audio-visual semantic and contextual information relies on a core functional network involving the superior temporal sulcus (STS) and lateral sulcus (LS). Peak activations in both sulci co-localize with face or voice patches that have been previously described. While all of these regions of interest (ROIs) respond to both auditory and visual information, LS ROIs have a preference for auditory and audio-visual congruent stimuli while STS ROIs equally respond to auditory, visual and audio-visual congruent stimuli. To further specify the cortical network involved in the control of this semantic association, we performed a whole brain gPPI functional connectivity analysis on the LS and STS cumulated ROIs. This gPPI analysis highlights a functional network connected to the LS and STS, involving the anterior cingulate cortex (ACC), area 46 in the dorsolateral prefrontal cortex (DLPFC), the orbitofrontal cortex (OFC), the intraparietal sulcus (IPS), the insular cortex and subcortically, the amygdala and the hippocampus. Comparing human and macaque results, we propose that the integration of audio-visual information for congruent, meaningful social events involves homologous neural circuitry, specifically, an emotional network composed of the STS, LS, ACC, OFC, and limbic areas, including the amygdala, and an attentional network including the STS, LS, IPS and DLPFC. As such, these networks are critical to the amodal representation of social meaning, thereby providing an explanation for some of deficits observed in autism.