Waveform to Single Sinusoid Regression to Estimate the F0 Contour from Noisy Speech Using Recurrent Deep Neural Networks

Kato, Akihiro; Kinnunen, Tomi

doi:10.21437/interspeech.2018-1671

Cited by 8 publications

(17 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The counter-intuitive behavior of the FPE curve for GteAug and SRH can be explained by the increasing number of voicing errors: as more low-energy frames, from which F0 is generally harder to detect, are classified as unvoiced when SNR decreases, the number of frames from which FPE is computed decreases. These results also compare favorably against the DNN-based results recently reported on a subset of the same PTDB-TUG corpus [10]. However, direct comparison is difficult, as the authors of [10] performed cropping of silence regions in the signals before SNR calculation.…”

Section: Resultssupporting

confidence: 62%

“…However, direct comparison is difficult, as the authors of [10] performed cropping of silence regions in the signals before SNR calculation. Still, GPE in [10] is always substantially larger than in GteAug across the entire SNR range, while FPE is similar to the now-proposed approach.…”

Section: Resultsmentioning

confidence: 99%

“…Another strategy is to utilize parallel recordings of speech and electroglottography (EGG) [10]. This approach, however, is limited by difficulties in recording large amounts of such parallel data, and by the fact that EGG might not give accurate F0 estimates due to, for example, imperfect attachment of the EGG electrodes or inaccuracies of automated estimation algorithms [3].…”

Section: Ground Truth Enhancementmentioning

confidence: 99%

“…In recent years with the spread of deep learning, neural estimation of F0 has also been explored. For example, CREPE [9] has produced state-of-the-art results in generic audio pitch tracking, and single sinusoid regression [10] has improved the state-of-the-art F0 estimation performance in noisy conditions. The applicability of neural networks for noise-robust F0 estimation is easy to understand: As the neural network is trained for a regression or classification task from a signal-level input to a known target F0 output, during training the input can be masked, for example, by additive noise which makes the model learn to handle noisy inputs.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Data Augmentation Strategies for Neural Network F0 Estimation

Airaksinen

Juvela

Alku

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This study explores various speech data augmentation methods for the task of noise-robust fundamental frequency (F0) estimation with neural networks. The explored augmentation strategies are split into additive noise and channel-based augmentation and into vocoder-based augmentation methods. In vocoder-based augmentation, a glottal vocoder is used to enhance the accuracy of ground truth F0 used for training of the neural network, as well as to expand the training data diversity in terms of F0 patterns and vocal tract lengths of the talkers. Evaluations on the PTDB-TUG corpus indicate that noise and channel augmentation can be used to greatly increase the noise robustness of trained models, and that vocoder-based ground truth enhancement further increases model performance. For smaller datasets, vocoder-based diversity augmentation can also be used to increase performance. The best-performing proposed method greatly outperformed the compared F0 estimation methods in terms of noise robustness.

show abstract

Section: Resultssupporting

confidence: 62%

Section: Resultsmentioning

confidence: 99%

Section: Ground Truth Enhancementmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Data Augmentation Strategies for Neural Network F0 Estimation

Airaksinen

Juvela

Alku

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…In [21], the authors used a CNN to optimize both a classification and regression cost, where a GCI is simultaneously detected and localized in a frame. Other recent related works used regression-based approaches with neural networks for f0 [22] or glottal source parameters estimation (including GCI) [23]. However, those approaches all rely on EGG signals for establishing the ground truth reference used for training the networks.…”

Section: Introductionmentioning

confidence: 99%

GCI Detection from Raw Speech Using a Fully-Convolutional Network

Ardaillon

Röebel

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Glottal Closure Instants (GCI) detection consists in automatically detecting temporal locations of most significant excitation of the vocal tract from the speech signal. It is used in many speech analysis and processing applications, and various algorithms have been proposed for this purpose. Recently, new approaches using convolutional neural networks have emerged , with encouraging results. Following this trend, we propose a simple approach that performs a regression from the speech waveform to a target signal from which the GCI are easily obtained by peak-picking. However, the ground truth GCI used for training and evaluation are usually extracted from EGG signals, which are not reliable and often not available. To overcome this problem, we propose to train our network on high-quality synthetic speech with perfect ground truth. The performances of the proposed algorithm are compared with three other state-of-the-art approaches using publicly available datasets, and the impact of using controlled synthetic or real speech signals in the training stage is investigated. The experimental results demonstrate that the proposed method obtains similar or better results than other state-of-the-art algorithms and that using large synthetic datasets with many speaker offers better generalization ability than using a smaller database of real speech and EGG signals.

show abstract