Abstract-We present the Dynamic Programming Projected Phase-Slope Algorithm (DYPSA) for automatic estimation of glottal closure instants (GCIs) in voiced speech. Accurate estimation of GCIs is an important tool that can be applied to a wide range of speech processing tasks including speech analysis, synthesis and coding. DYPSA is automatic and operates using the speech signal alone without the need for an EGG signal. The algorithm employs the phase-slope function and a novel phase-slope projection technique for estimating GCI candidates from the speech signal. The most likely candidates are then selected using a dynamic programming technique to minimize a cost function that we define. We review and evaluate three existing methods of GCI estimation and compare the new DYPSA algorithm to them. Results are presented for the APLAWD and SAM databases for which 95.7% and 93.1% of GCIs are correctly identified.
The pseudo-periodicity of voiced speech can be exploited in several speech processing applications. This requires however that the precise locations of the Glottal Closure Instants (GCIs) are available. The focus of this paper is the evaluation of automatic methods for the detection of GCIs directly from the speech waveform. Five state-of-the-art GCI detection algorithms are compared using six different databases with contemporaneous electroglottographic recordings as ground truth, and containing many hours of speech by multiple speakers. The five techniques compared are the Hilbert Envelope-based detection (HE), the Zero Frequency Resonator-based method (ZFR), the Dynamic Programming Phase Slope Algorithm (DYPSA), the Speech Event Detection using the Residual Excitation And a Meanbased Signal (SEDREAMS) and the Yet Another GCI Algorithm (YAGA). The efficacy of these methods is first evaluated on clean speech, both in terms of reliabililty and accuracy. Their robustness to additive noise and to reverberation is also assessed. A further contribution of the paper is the evaluation of their performance on a concrete application of speech processing: the causal-anticausal decomposition of speech. It is shown that for clean speech, SEDREAMS and YAGA are the best performing techniques, both in terms of identification rate and accuracy. ZFR and SEDREAMS also show a superior robustness to additive noise and reverberation.
Abstract-Accurate estimation of glottal closing instants (GCIs) and opening instants (GOIs) is important for speech processing applications that benefit from glottal-synchronous processing including pitch tracking, prosodic speech modification, speech dereverberation, synthesis and study of pathological voice. We propose the Yet Another GCI/GOI Algorithm (YAGA) to detect GCIs from speech signals by employing multiscale analysis, the group delay function, and -best dynamic programming. A novel GOI detector based upon the consistency of the candidates' closed quotients relative to the estimated GCIs is also presented. Particular attention is paid to the precise definition of the glottal closed phase, which we define as the analysis interval that produces minimum deviation from an all-pole model of the speech signal with closedphase linear prediction (LP). A reference algorithm analyzing both electroglottograph (EGG) and speech signals is described for evaluation of the proposed speech-based algorithm. In addition to the development of a GCI/GOI detector, an important outcome of this work is in demonstrating that GOIs derived from the EGG signal are not necessarily well-suited to closed-phase LP analysis. Evaluation of YAGA against the APLAWD and SAM databases show that GCI identification rates of up to 99.3% can be achieved with an accuracy of 0.3 ms and GOI detection can be achieved equally reliably with an accuracy of 0.5 ms.Index Terms-Dynamic programming, electroglottograph (EGG), glottal closing instants (GCIs), glottal opening instants (GOIs), group delay function, multiscale analysis, speech processing.
Abstract-Measures based on the group delay of the LPC residual have been used by a number of authors to identify the time instants of glottal closure in voiced speech. In this paper, we discuss the theoretical properties of three such measures and we also present a new measure having useful properties. We give a quantitative assessment of each measure's ability to detect glottal closure instants evaluated using a speech database that includes a direct measurement of glottal activity from a Laryngograph/EGG signal. We find that when using a fixed-length analysis window, the best measures can detect the instant of glottal closure in 97% of larynx cycles with a standard deviation of 0.6 ms and that in 9% of these cycles an additional excitation instant is found that normally corresponds to glottal opening. We show that some improvement in detection rate may be obtained if the analysis window length is adapted to the speech pitch. If the measures are applied to the preemphasized speech instead of to the LPC residual, we find that the timing accuracy worsens but the detection rate improves slightly. We assess the computational cost of evaluating the measures and we present new recursive algorithms that give a substantial reduction in computation in all cases.
We propose a novel feature set for speaker recognition that is based on the voice source signal. The feature extraction process uses closed-phase LPC analysis to estimate the vocal tract transfer function. The LPC spectrum envelope is converted to cepstrum coefficients which are used to derive the voice source features. Unlike approaches based on inverse-filtering, our procedure is robust to LPC analysis errors and low-frequency phase distortion. We have performed text-independent closed-set speaker identification experiments on the TIMIT and the YOHO databases using a standard Gaussian mixture model technique. Compared to using melfrequency cepstrum coefficients, the misclassification rate for the TIMIT database reduced from 1.51% to 0.16% when combined with the proposed voice source features. For the YOHO database the misclassification rate decreased from 13.79% to 10.07%. The new feature vector also compares favourably to other proposed voice source feature sets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.