words)In face-to-face communication, audio-visual (AV) stimuli can be fused, combined or perceived as mismatching. While the left superior temporal sulcus (LSTS) is admittedly the locus of AV integration, the process leading to combination is unknown. Analysing behaviour and time-/source-resolved human MEG data, we show that fusion and combination both involve early detection of AV physical features discrepancy in the LSTS, but that this initial registration is followed, in combination only, by the activation of AV asynchrony-sensitive regions (auditory and inferior frontal cortices). Based on dynamic causal modelling and neural signal decoding, we further show that AV speech integration outcome primarily depends on whether the LSTS quickly converges or not onto an existing multimodal syllable representation, and that combination results from subsequent temporal re-ordering of the discrepant AV stimuli in time-sensitive regions of the prefrontal and temporal cortices. Keywords Audio-visual integration, Combination, McGurk effect, Neural dynamics, Audio-visual asynchrony./abga/ or /agba/. What determines whether AV stimuli are going to be fused 2-4 or combined 5 , and the underlying neural dynamics of such a perceptual divergence is not known yet.Audio-visual speech integration draws on a number of processing steps distributed over several cortical regions, including auditory and visual cortices, the left posterior temporal cortex, and higher-level language regions of the left prefrontal 6,7 and anterior temporal cortex 8,9 . In this cortical hierarchy, the left superior temporal sulcus (LSTS) plays a central role in integrating visual and auditory inputs from the visual motion area (mediotemporal cortex, MT) and the auditory cortex (AC) 10-15 . The LSTS is characterized by relatively smooth temporal integration properties that enables it to cope with the natural asynchrony between auditory and visual speech inputs, i.e. the fact that orofacial speech movements often start before the sounds they produce 4,16,17 . Although the LSTS responds better when auditory and visual speech are perfectly synchronous 18,19 , its activity can cope with large temporal discrepancies, reflecting a broad temporal window of integration in the order of the syllable length (up to ~260 ms) 20 . This large window of integration can even be pathologically stretched to about 1s in subjects suffering from autism spectrum disorder 21 .Yet, the detection of shorter temporal AV asynchronies is possible and takes place in other brain regions, in particular in the dorsal premotor area and the inferior frontal gyrus [22][23][24][25] .considering the temporal patterns in a 2 nd acoustic formant/lip aperture two-dimensional (2D) feature space is sufficient to qualitatively reproduce participants' behaviour for fused 13,29 but also combined responses 28 . Simulations indicated that fusion is possible, and even expected, when the physical features of the A and V stimulus, represented by the 2 nd formant and lip in the model, are located in the neighb...