Music and speech encode hierarchically organized structural complexity at the service of human expressiveness and communication. Carefully controlled experiments have suggested that populations of neurons in cortical auditory regions track temporal modulations within rhythmic acoustic signals, physiologically supporting perception of both music and speech. However, whether cortical tracking of music and speech extends to less controlled (i.e., naturalistic) signals remains contentious. Here, we investigated whether cortical tracking can be observed under more natural perceptual scenarios, and how stimulus type, frequency band or anatomical localization modulates this effect. We analyzed intracranial recordings from 30 subjects while they passively watched a movie where visual scenes were accompanied by either music or speech stimuli. Cross-correlation between brain and acoustic signals, along with density-based clustering analyses and linear mixed effect modeling, revealed both anatomically overlapping and functionally distinct mapping of the tracking effect as a function of stimulus type and frequency band. We observed widespread tracking of music and speech signals in the Slow Frequency Band (1-8Hz), with near zero temporal lags and high mixed-selectivity. In contrast, High Frequency Band (70-120Hz) tracking was higher during speech perception, was more densely concentrated in classical language processing areas, and showed a clear frontal-to-temporal gradient in lag values that was not observed during perception of musical stimuli. Our results highlight the recruitment of domain-general and domain-specific mechanisms during perception of naturalistic music and speech signals, as well as a complex interaction between cortical region and frequency band that shapes temporal dynamics during processing of hierarchically organized temporal structures in speech.