Music exists in almost every society, has universal acoustic features, and is processed by distinct neural circuits in humans even with no experience of musical training. These characteristics suggest an innateness of the sense of music in our brain, but it is unclear how this innateness emerges and what functions it has. Here, using an artificial deep neural network that models the auditory information processing of the brain, we show that units tuned to music can spontaneously emerge by learning natural sound detection, even without learning music. By simulating the responses of network units to 35,487 natural sounds in 527 categories, we found that various subclasses of music are strongly clustered in the embedding space, and that this clustering arises from the music-selective response of the network units. The music-selective units encoded the temporal structure of music in multiple timescales, following the population-level response characteristics observed in the brain. We confirmed that the process of generalization is critical for the emergence of music-selectivity and that music-selectivity can work as a functional basis for the generalization of natural sound, thereby elucidating its origin. These findings suggest that our sense of music can be innate, universally shaped by evolutionary adaptation to process natural sound.One-sentence summaryMusic-selectivity can arise spontaneously in deep neural networks trained for natural sound detection without learning music.