Computer studies on parametric coding of speech spectra

Flanagan, James L.; Christensen, S. W.

doi:10.1121/1.384753

Cited by 11 publications

(3 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…of a jixed filter bank. The amplitude, frequency, and phase measurements of the filter outputs are then used in various configurations of speech synthesizers [7]. Although the present work is based on the discrete Fourier transform (DFT), which can be interpreted as a filter bank, the use of a high-resolution DFT in combination with peak picking renders a highly adaptive filter .…”

Section: Discussionmentioning

confidence: 99%

Speech analysis/Synthesis based on a sinusoidal representation

McAulay

Quatieri

1986

IEEE Trans. Acoust., Speech, Signal Process.

1,183

593

View full text Add to dashboard Cite

A sinusoidal model for the speech waveform is used to develop a new analysislsynthesis technique that is characterized by the amplitudes, frequencies, and phases of the component sine waves. These parameters are estimated from the short-time Fourier transform using a simple peak-picking algorithm. Rapid changes in the highly resolved spectral components are tracked using the concept of "birth" and "death" of the underlying sine waves. For a given frequency track a cubic function is used to unwrap and interpolate the phase such that the phase track is m,aximally smooth. This phase function is applied to a sine-wave generator, which is amplitude modulated and added to the other sine waves to give the final speech output. The resulting synthetic waveform preserves the general waveform shape and is essentially perceptually indistinguishable from the original speech. Furthermore, in the presence of noise the perceptual characteristics of the speech as well as the noise are maintained. In addition, it was found that the representation was sufficiently general that high-quality reproduction was obtained for a larger class of inputs including: two overlapping, superposed speech waveforms; music waveforms; speech in musical backgrounds; and certain marine biologic sounds. Finally, the analysis/synthesis system forms the basis for new approaches to the problems of speech transformations including timescale and pitch-scale modification, and midrate speech coding [SI, [9].

show abstract

Section: Discussionmentioning

confidence: 99%

Speech analysis/Synthesis based on a sinusoidal representation

McAulay

Quatieri

1986

IEEE Trans. Acoust., Speech, Signal Process.

1,183

593

View full text Add to dashboard Cite

show abstract

“…In both cases, however, one has to decide on a time-frequency scale at which to calculate the amplitude envelopes. In psychophysical experiments, the scale is often chosen by using filters with a one-fourth octave bandwidth because this value matches the measured critical band of audition in humans (Flanagan and Christensen, 1980;Drullman, 1995). Shannon and colleagues have also shown that speech comprehension increases rapidly as the number of frequency bands is increased from one very wide band to a small number of still relatively wide bands, emphasizing the relative importance of temporal structure over spectral structure in speech comprehension (Shannon et al, 1995).…”

Section: Time-frequency Tuning Of Hvc Neurons and Speech Psychophysicsmentioning

confidence: 99%

Temporal and Spectral Sensitivity of Complex Auditory Neurons in the Nucleus HVc of Male Zebra Finches

1998

View full text Add to dashboard Cite

Complex vocalizations, such as human speech and birdsong, are characterized by their elaborate spectral and temporal structure. Because auditory neurons of the zebra finch forebrain nucleus HVc respond extremely selectively to a particular complex sound, the bird's own song (BOS), we analyzed the spectral and temporal requirements of these neurons by measuring their responses to systematically degraded versions of the BOS. These synthetic songs were based exclusively on the set of amplitude envelopes obtained from a decomposition of the original sound into frequency bands and preserved the acoustical structure present in the original song with varying degrees of spectral versus temporal resolution, which depended on the width of the frequency bands. Although both excessive temporal or spectral degradation eliminated responses, HVc neurons responded well to degraded synthetic songs with timefrequency resolutions of ϳ5 msec or 200 Hz. By comparing this neuronal time-frequency tuning with the time-frequency scales that best represented the acoustical structure in zebra finch song, we concluded that HVc neurons are more sensitive to temporal than to spectral cues. Furthermore, neuronal responses to synthetic songs were indistinguishable from those to the original BOS only when the amplitude envelopes of these songs were represented with 98% accuracy. That level of precision was equivalent to preserving the relative time-varying phase across frequency bands with resolutions finer than 2 msec. Spectral and temporal information are well known to be extracted by the peripheral auditory system, but this study demonstrates how precisely these cues must be preserved for the full response of high-level auditory neurons sensitive to learned vocalizations.

show abstract

“…Presumably, some of these difficulties were related to the method used to produce the auditory filtered spectra (see Method section). Further re-search will employ more direct methods of deriving the auditory filtered display, such as those proposed by Klatt (1976Klatt ( , 1979 and Flanagan and Christensen (1980). For example, direct filtering would result in a shortening of the, analysis time window for highfrequency energy.…”

Section: Discussionmentioning

confidence: 99%

Time-varying features of initial stop consonants in auditory running spectra: A first report

Kewley‐Port

Luce

1984

Perception & Psychophysics

View full text Add to dashboard Cite

recently demonstrated that place of articulation of initial voiced stops could be identified from time-varying features observed in visual displays of linear prediction smoothed spectra. The present study extends this method of analysis in several directions. First, both voiced and voiceless syllable-initial stops produced at three speaking rates-normal, fast, and slow-were examined. Second, a new rule for vocal tract size normalization was tested. Third. the earlier time-varying features were augmented in order to specify the burst and voicing as well as place of articulation. The four time-varying features were (1) an abrupt increase in energy at high frequencies, (2) the onset of a prominent low-frequency peak. (3) the relative tilt of voiceless energy at onset, and (4) the presence of extended midfrequency peaks. Finally, the visual displays were modified to incorporate filtering and other characteristics of processing of speech by the auditory system. Auditory running spectra were generated for stop consonantvowel syllables read by two males and two females. Employing the four time-varying features, judges first located the burst and onset of voicing, and then identified place of articulation from the visual displays. Over all conditions, place of articulation was identified at an 86% level of accuracy. While these results constitute only a first step towards an automated analysis procedure, they nonetheless indicate that our new time-varying features are appropriate for identifying place of articulation across both voicedand voiceless stops produced by different speakers at different speaking rates.353

show abstract

Computer studies on parametric coding of speech spectra

Cited by 11 publications

References 0 publications

Speech analysis/Synthesis based on a sinusoidal representation

Speech analysis/Synthesis based on a sinusoidal representation

Temporal and Spectral Sensitivity of Complex Auditory Neurons in the Nucleus HVc of Male Zebra Finches

Time-varying features of initial stop consonants in auditory running spectra: A first report

Contact Info

Product

Resources

About