When social animals communicate, the onset of informative content in one modality varies considerably relative to the other, such as when visual orofacial movements precede a vocalization. These naturally occurring asynchronies do not disrupt intelligibility or perceptual coherence. However, they occur on time scales where they likely affect integrative neuronal activity in ways that have remained unclear, especially for hierarchically downstream regions in which neurons exhibit temporally imprecise but highly selective responses to communication signals. To address this, we exploited naturally occurring face-and voice-onset asynchronies in primate vocalizations. Using these as stimuli we recorded cortical oscillations and neuronal spiking responses from functional MRI (fMRI)-localized voice-sensitive cortex in the anterior temporal lobe of macaques. We show that the onset of the visual face stimulus resets the phase of low-frequency oscillations, and that the face-voice asynchrony affects the prominence of two key types of neuronal multisensory responses: enhancement or suppression. Our findings show a three-way association between temporal delays in audiovisual communication signals, phase-resetting of ongoing oscillations, and the sign of multisensory responses. The results reveal how natural onset asynchronies in cross-sensory inputs regulate network oscillations and neuronal excitability in the voice-sensitive cortex of macaques, a suggested animal model for human voice areas. These findings also advance predictions on the impact of multisensory input on neuronal processes in face areas and other brain regions.H ow the brain parses multisensory input despite the variable and often large differences in the onset of sensory signals across different modalities remains unclear. We can maintain a coherent multisensory percept across a considerable range of spatial and temporal discrepancies (1-4): For example, auditory and visual speech signals can be perceived as belonging to the same multisensory "object" over temporal windows of hundreds of milliseconds (5-7). However, such misalignment can drastically affect neuronal responses in ways that may also differ between brain regions (8-10). We asked how natural asynchronies in the onset of face/voice content in communication signals would affect voice-sensitive cortex, a region in the ventral "object" pathway (11) where neurons (i) are selective for auditory features in communication sounds (12)(13)(14), (ii) are influenced by visual "face" content (12), and (iii) display relatively slow and temporally variable responses in comparison with neurons in primary auditory cortical or subcortical structures (14-16).Neurophysiological studies in human and nonhuman animals have provided considerable insights into the role of cortical oscillations during multisensory conditions and for parsing speech. Cortical oscillations entrain to the slow temporal dynamics of natural sounds (17)(18)(19)(20) and are thought to reflect the excitability of local networks to sensory inputs...