A F0 and voicing status estimation algorithm for high quality speech analysis/synthesis is proposed. This problem is approached from a different perspective that models the behavior of feature extractors under noise, instead of directly modeling speech signals. Under time-frequency locality assumptions, the joint distribution of extracted features and target F0 can be characterized by training a bank of Gaussian mixture models (GMM) on artificial data generated from Monte-Carlo simulations. The trained GMMs can then be used to generate a set of conditional distributions on the predicted F0, which are then combined and post-processed by Viterbi algorithm to give a final F0 trajectory. Evaluation on CSTR and CMU Arctic speech databases shows that the proposed method, trained on fully synthetic data, achieves lower gross error rates than state-of-the-art methods.
We present improvements to the refinement stage of YANGsaf[1] (Yet ANother Glottal source analysis framework), a recently published F0 estimation algorithm by Kawahara et al., for noisy/breathy speech signals. The baseline system, based on time-warping and weighted average of multi-band instantaneous frequency estimates, is still sensitive to additive noise when none of the harmonic provide reliable frequency estimate at low SNR. We alleviate this problem by calibrating the weighted averaging process based on statistics gathered from a Monte-Carlo simulation, and applying Kalman filtering to refined F0 trajectory with time-varying measurement and process distributions. The improved algorithm, adYANGsaf (adaptive Yet ANother Glottal source analysis framework), achieves significantly higher accuracy and smoother F0 trajectory on noisy speech while retaining its accuracy on clean speech, with little computational overhead introduced.
This study focuses on generating fundamental frequency (F0) curves of singing voice from musical scores stored in a midilike notation. Current statistical parametric approaches to singing F0 modeling meet difficulties in reproducing vibratos and the temporal details at note boundaries due to the oversmoothing tendency of statistical models. This paper presents a neural network based solution that models a pair of neighboring notes at a time (the transition model) and uses a separate network for generating vibratos (the sustain model). Predictions from the two models are combined by summation after proper enveloping to enforce continuity. In the training phase, mild misalignment between the scores and the target F0 is addressed by back-propagating the gradients to the networks' inputs. Subjective listening tests on the NITech singing database show that transition-sustain models are able to generate F0 trajectories close to the original performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.