“…A very productive line of research put the emphasis on the temporal aspects of the speech structure and explored speech perception in terms of temporal-modulation processing (e.g., Houtgast and Steeneken, 1973;Plomp, 1983;Rosen, 1992;Drullman, 1995;Shannon et al, 1995;Zeng et al, 2005;Moore, 2008;Shamma and Lorenzi, 2013). Altogether, these studies demonstrated that (i) speech sounds convey salient modulations in amplitude (AM) and frequency (FM) resulting from the dynamic modulation of the vocal-tract geometric characteristics and vocal-fold vibrations (e.g., Varnet et al, 2017); (ii) the human auditory system is exquisitely sensitive to these modulation cues and certainly optimized to detect and discriminate modulation cues at the output of perceptual filters selectively tuned in the AM domain (Rodriguez et al, 2010;Koumura et al, 2019) and, in the case of slow FM carried by low-frequency sounds, due to temporal coding mechanisms using neural phase-locking to the temporal fine structure of narrowband signals at the output of cochlear filters (Paraouty et al, 2018); and (iii) the ability to identify speech in a variety of listening conditions is constrained by the ability to perceive accurately these relatively slow AM and FM components (e.g., Fu, 2002;Johannesen et al, 2016;Parthasarathy et al, 2020).…”