Rhythm is essential in music. It is formed by the hierarchical structure of events such as notes, beat, accent and grouping. The core acoustic correlates of these hierarchical rhythmic events, however, remain unknown. This study examined the slow (<40 Hz) temporal modulation structure of music by applying two modeling approaches previously applied to infant- and child-directed speech. The first approach utilized a low-dimensional representation of the auditory signal based on the cochlear filterbank in the human brain, comprising 3 broad bands of amplitude modulation and 5 broad spectral bands; a Spectral-Amplitude Modulation Phase Hierarchy (S-AMPH) modelling approach. The second utilized probabilistic amplitude demodulation (PAD), a signal-driven modelling approach that makes no adjustments for the brain. PAD has produced successful models of natural sounds like wind and rain, which are characterized by amplitude modulation 'cascades' correlated over long time scales and across multiple frequency bands. When applied to music, both models revealed a very similar hierarchically-nested amplitude modulation spectrum across different musical genres and instruments, including song. Accordingly, the core temporal architecture yielding musical rhythm appears universal. This same architecture is revealed by modelling infant- and child-directed speech, suggesting that both music and language may depend on a shared domain-general amplitude modulation architecture which, early in life, relies on shared neural processes. The demonstration that temporal modulation bands play a key role in rhythm hierarchies in music as well as in speech could suggest that the same evolutionary adaptations may underpin both music and language, explaining why both are human universals.