Decomposing speech signals into periodic and aperiodic components is an important task, finding applications in speech synthesis, coding, denoising, etc. In this paper, we construct a time-frequency coherence function to analyze spectro-temporal signatures of speech signals for distinguishing between deterministic and stochastic components of speech. The narrowband speech spectrogram is segmented into patches, which are represented as 2-D cosine carriers modulated in amplitude and frequency. Separation of carrier and amplitude/frequency modulations is achieved by 2-D demodulation using Riesz transform, which is the 2-D extension of Hilbert transform. The demodulated AM component reflects contributions of the vocal tract to spectrogram. The frequency modulated carrier (FM-carrier) signal exhibits properties of the excitation. The time-frequency coherence is defined with respect to FM-carrier and a coherence map is constructed, in which highly coherent regions represent nearly periodic and deterministic components of speech, whereas the incoherent regions correspond to unstructured components. The coherence map shows a clear distinction between deterministic and stochastic components in speech characterized by jitter, shimmer, lip radiation, type of excitation, etc. Binary masks prepared from the time-frequency coherence function are used for periodic-aperiodic decomposition of speech. Experimental results are presented to validate the efficiency of the proposed method.
We consider a two-dimensional demodulation framework for spectro-temporal analysis of the speech signal. We construct narrowband (NB) speech spectrograms, and demodulate them using the Riesz transform, which is a two-dimensional extension of the Hilbert transform. The demodulation results in timefrequency envelope (amplitude modulation or AM) and timefrequency carrier (frequency modulation or FM). The AM corresponds to the vocal tract and is referred to as the vocal tract spectrogram. The FM corresponds to the underlying excitation and is referred to as the carrier spectrogram. The carrier spectrogram exhibits a high degree of time-frequency consistency for voiced sounds. For unvoiced sounds, such a structure is lacking. In addition, the carrier spectrogram reflects the fundamental frequency (F0) variation of the speech signal. We develop a technique to determine the F0 from the carrier spectrogram. The time-frequency consistency is used to determine which time-frequency regions correspond to voiced segments. Comparisons with the state-of-the-art F0 estimation algorithms show that the proposed F0 estimator has high accuracy for telephone channel speech and is robust to noise.
We address the problem of estimating the time-varying spectral envelope of a speech signal using a spectro-temporal demodulation technique. Unlike the conventional spectrogram, we consider a pitch-adaptive spectrogram and model a spectrotemporal patch using an amplitude-and frequency-modulated two-dimensional (2-D) cosine signal. We employ a demodulation technique based on the Riesz transform that we proposed recently to estimate the amplitude and frequency modulations. The amplitude modulation (AM) corresponds to the vocal-tract filter magnitude response (or envelope) and the frequency modulation (FM) corresponds to the excitation. We consider the AM and demonstrate its effectiveness by incorporating it as an acoustic feature for local conditioning in the statistical WaveNet vocoder for the task of speech synthesis. The quality of the synthesized speech obtained with the Riesz envelope is compared with that obtained using the envelope estimated by the WORLD vocoder. Objective measures and subjective listening tests on the CMU-Arctic database show that the quality of synthesis is superior to that obtained using the WORLD envelope. This study thus establishes the Riesz envelope as an efficient alternative to the WORLD envelope.
In contrast to 1-D short-time analysis of speech, 2-D modeling of spectrograms provides a characterization of speech attributes directly in the joint time-frequency plane. Building on existing 2-D models to analyze a spectrogram patch, we propose a multicomponent 2-D AM-FM representation for spectrogram decomposition. The components of the proposed representation comprise a DC, a fundamental frequency carrier and its harmonics, and a spectrotemporal envelope, all in 2-D. The number of harmonics required is patch-dependent. The estimation of the AM and FM is done using the Riesz transform, and the component weights are estimated using a least-squares approach. The proposed representation provides an improvement over existing state-of-the-art approaches, for both male and female speakers. This is quantified using reconstruction SNR and perceptual evaluation of speech quality (PESQ) metric. Further, we perform an overlap-add on the DC component, pooling all the patches and obtain a time-frequency (t-f) aperiodicity map for the speech signal. We verify its effectiveness in improving speech synthesis quality by using it in an existing state-of-theart vocoder.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.