Estimation of pitch from a given segment of speech plays an eminent role in various speech processing applications, such as speech coding, speech recognition, speaker recognition tasks, speech synthesis, etc. Even though, there are several efficient algorithms, estimation of pitch frequency from speech signals that are severely degraded by noise is still a challenging task. In this paper, we propose a robust framework for pitch estimation using harmonic product spectrum (HPS) derived from discrete cosine transform (DCT) of the signal. This novel method exploits the better decorrelating nature of the DCT spectrum that enables the pitch harmonics to appear sharper in its spectrum. Potentially, this facilitates accurate pitch estimation at lower order of the harmonic product spectrum when compared with DFT-based HPS. Systematic evaluation is carried out to analyze the performance of the proposed method in comparison with some of the successful algorithms, like DFT-based HPS, SIFT, and Cepstrum-based technique. The results clearly show that the proposed algorithm outperforms the other algorithms for speech signals that are severely corrupted by noise (low SNR). The effectiveness of this method for different durations of analysis window, various orders of HPS, and the refinements are also discussed.
Voiced speech is produced by excitation of the vocal tract system with the quasiperiodic vibrations of the vocal folds at the glottis. These excitations have become significantly stronger when the vocal folds are fully opened or about to be closed. In this work, the focus is on estimating these instants of significant excitation using temporal phase periodicity present in the speech signal. Assuming the quasiperiodic vibrations of the vocal folds as a slowly varying sinusoid, the phase of this signal is computed using the phase of the first frequency component of the discrete Fourier transform. At the peaks of the speech signal, i.e., at the locations of significant instants, the phase of this component is expected to be zero. Temporal phase function is evaluated by moving the analysis window sample by sample and the instants at which this phase function crosses zero are the significant instants in the speech signal. To analyze the performance of this technique, 30 seconds of speech data from TIMIT speech corpus is considered, uttered by both male and female speakers. The performance of this technique is compared with the manually marked instants of significant excitation, and is found to be promising. The effectiveness of this technique for different durations of analysis window is also discussed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.