Electrophysiological studies show that top-down modulation enhances neural tracking of attended speech in environments with overlapping speech. Yet, the specific cortical regions involved remain unclear due to the limited spatial resolution of most electrophysiological techniques. Therefore, we performed speech envelope reconstruction and representational dissimilarity-based EEG-fMRI fusion (using temporal response function estimated from EEG, n = 19, and fMRI, n = 19) to determine the spatiotemporal dynamics of attention to audiovisual cocktail-party speech. Attention related enhancement of neural tracking fluctuated in predictable temporal profiles. Such temporal dynamics may arise due to interactions between attention and prediction or other plastic mechanisms in the auditory cortex, or both. EEG-fMRI fusion revealed attention-related recurrent feedforward-feedback loops in the ventral processing stream. Our findings support models where attention facilitates dynamic neural changes in the auditory cortex, ultimately aiding discrimination of relevant sounds from irrelevant ones using minimal neural resources.