A computational model of auditory analysis is described that is inspired by psychoacoustical and neurophysiological findings in early and central stages of the auditory system. The model provides a unified multiresolution representation of the spectral and temporal features likely critical in the perception of sound. Simplified, more specifically tailored versions of this model have already been validated by successful application in the assessment of speech intelligibility [Elhilali et al., Speech Commun. 41(2-3), 331-348 (2003); Chi et al., J. Acoust. Soc. Am. 106, 2719-2732 (1999)] and in explaining the perception of monaural phase sensitivity [R. Carlyon and S. Shamma, J. Acoust. Soc. Am. 114, 333-348 (2003)]. Here we provide a more complete mathematical formulation of the model, illustrating how complex signals are transformed through various stages of the model, and relating it to comparable existing models of auditory processing. Furthermore, we outline several reconstruction algorithms to resynthesize the sound from the model output so as to evaluate the fidelity of the representation and contribution of different features and cues to the sound percept.
Detection thresholds for spectral and temporal modulations are measured using broadband spectra with sinusoidally rippled profiles that drift up or down the log-frequency axis at constant velocities. Spectro-temporal modulation transfer functions (MTFs) are derived as a function of ripple peak density (Ω cycles/octave) and drifting velocity (ω Hz). The MTFs exhibit a low-pass function with respect to both dimensions, with 50% bandwidths of about 16 Hz and 2 cycles/octave. The data replicate (as special cases) previously measured purely temporal MTFs (Ω=0) [Viemeister, J. Acoust. Soc. Am. 66, 1364–1380 (1979)] and purely spectral MTFs (ω=0) [Green, in Auditory Frequency Selectivity (Plenum, Cambridge, 1986), pp. 351–359]. A computational auditory model is presented that exhibits spectro-temporal MTFs consistent with the salient trends in the data. The model is used to demonstrate the potential relevance of these MTFs to the assessment of speech intelligibility in noise and reverberant conditions.
Speech intelligibility is known to be relatively unaffected by certain deformations of the acoustic spectrum. These include translations, stretching or contracting dilations, and shearing of the spectrum (represented along the logarithmic frequency axis). It is argued here that such robustness reflects a synergy between vocal production and auditory perception. Thus, on the one hand, it is shown that these spectral distortions are produced by common and unavoidable variations among different speakers pertaining to the length, cross-sectional profile, and losses of their vocal tracts. On the other hand, it is argued that these spectral changes leave the auditory cortical representation of the spectrum largely unchanged except for translations along one of its representational axes. These assertions are supported by analyses of production and perception models. On the production side, a simplified sinusoidal model of the vocal tract is developed which analytically relates a few "articulatory" parameters, such as the extent and location of the vocal tract constriction, to the spectral peaks of the acoustic spectra synthesized from it. The model is evaluated by comparing the identification of synthesized sustained vowels to labeled natural vowels extracted from the TIMIT corpus. On the perception side a "multiscale" model of sound processing is utilized to elucidate the effects of the deformations on the representation of the acoustic spectrum in the primary auditory cortex. Finally, the implications of these results for the perception of generally identifiable classes of sound sources beyond the specific case of speech and the vocal tract are discussed.
The sound received at the ears is processed by humans using signalprocessing that separates the signal along intensity, pitch and timbre dimensions. Conventional Fourier-based signal processing, while endowed with fast algorithms, is unable to easily represent signal along these attributes. In this paper we use a recently proposed cortical representation to represent and manipulate sound. We briefly overview algorithms for obtaining, manipulating and inverting cortical representation of a sound and describe algorithms for manipulating signal pitch and timbre separately. The algorithms are first used to create sound of an instrument between a "guitar" and a "trumpet". Applications to creating maximally separable sounds in auditory user interfaces are discussed.Partial support of ONR grant N000140110571 is gratefully acknowledged.1. INTRODUCTION When a natural source such as a human voice or a musical instrument produces a sound, the resulting acoustic wave is generated by a time-varying excitation pattern of a possibly time-varying channel, and the sound characteristics depend both on the excitation signal and on the production system. The production system (e.g., human vocal tract, the guitar box, or the flute tube) has its own characteristic response; variation of the excitation parameters produces a sound signal that has different frequency components, but still retains perceptual characteristics of the uniqueness of the production instrument (identity of the person, type of instrumentpiano, violin, etc.) When one is asked to characterize this sound source using descriptions based on Fourier analysis one discovers that concepts such as frequency and amplitude are insufficient to explain the characteristics of the sound source. Human linguistic descriptions characterize the sound in terms of pitch and timbre.The perceived sound pitch is closely coupled with its harmonic structure. On the other hand, the timbre of the sound is defined broadly as everything other than the pitch, loudness, and the spatial location of the sound. For example, two musical instruments might have the same pitch if they play the same note, but it is their different timbre that allows us to distinguish between them. Specifically, the spectral envelope in frequency and the spectral envelope variations in time are related to the timbre percept. Most conventional techniques of sound manipulation result in simultaneous changes in both the pitch and timbre and cannot be used to assess the effects of the pitch and timbre dimensions independently. A goal of this paper is the development of controls for independent manipulation of pitch and timbre of a sound source using a cortical sound representation that was introduced in [1] and used for assessment of speech intelligibility and for prediction of the cortical response to an arbitrary stimulus. We simulate the multiscale audio representation and processing believed to occur in the primate brain (supported by recent psychophysiological papers [2]), and while our sound decomposition is p...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.