A processing scheme for speech signals is proposed that emulates synchrony capture in the auditory nerve. The role of stimulus-locked spike timing is important for representation of stimulus periodicity, low frequency spectrum, and spatial location. In synchrony capture, dominant single frequency components in each frequency region impress their time structures on temporal firing patterns of auditory nerve fibers with nearby characteristic frequencies (CFs). At low frequencies, for voiced sounds, synchrony capture divides the nerve into discrete CF territories associated with individual harmonics. An adaptive, synchrony capture filterbank (SCFB) consisting of a fixed array of traditional, passive linear (gammatone) filters cascaded with a bank of adaptively tunable, bandpass filter triplets is proposed. Differences in triplet output envelopes steer triplet center frequencies via voltage controlled oscillators (VCOs). The SCFB exhibits some cochlea-like responses, such as two-tone suppression and distortion products, and possesses many desirable properties for processing speech, music, and natural sounds. Strong signal components dominate relatively greater numbers of filter channels, thereby yielding robust encodings of relative component intensities. The VCOs precisely lock onto harmonics most important for formant tracking, pitch perception, and sound separation.
This work addresses the problem of identifying multiple fundamental frequencies in an acoustic signal. An auditoryinspired peripheral signal processing model is proposed that functions in a manner more like a bank of FM receivers rather than a traditional filterbank. Such receivers lock on to a strong signal (synchrony capture, frequency capture) even in the presence of nearby only slightly weaker signal components. Once the individual signal components are resolved, the model subjects them to an instantaneous nonlinearity and then performs harmonic grouping by cross correlating the isolated components. After the harmonically-related components are grouped, their pitches are computed using a standard summary autocorrelation approach.
Signal representation in the cochlea is often thought to involve either rate-place profiles or purely temporal, interspike interval codes. Spatio-temporal coding strategies based on phase-locking, cochlear delays, and coincidence detectors have also been proposed [Loeb et al., Biol. Cybern. (1983); K. & Shamma, J. Acoust. Soc. Am. 107 (2000); and Carney et al., Acoustica 88, 334–337 (2002)]. In this view, spatiotemporal patterns of spikes locked to relative phases of the traveling wave at specific cochlear places at a given time can convey information about a tone. We propose a general mathematical basis for using such spatial phase/amplitude patterns along the frequency axis to represent an arbitrary (approximately) time and bandwidth-limited signal. We posit that the spatial pattern of phases and amplitudes corresponds to locations at which (real and/or imaginary parts of) the Fourier transform of the signal crosses certain levels (e.g., zero level). Given these locations, we show that we can accurately reconstruct the original signal by solving a simple eigenvalue problem. Using this approach, we propose an analysis/synthesis algorithm to represent speech-like signals. We conjecture that a generalized representation of the forms of signals can be inferred from spatial, cross-CF patterns of phase relations present in the auditory nerve. [Work supported by AFSOR FA9550-09-1-0119.]
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.