Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1671
|View full text |Cite
|
Sign up to set email alerts
|

Waveform to Single Sinusoid Regression to Estimate the F0 Contour from Noisy Speech Using Recurrent Deep Neural Networks

Abstract: The fundamental frequency (F 0) represents pitch in speech that determines prosodic characteristics of speech and is needed in various tasks for speech analysis and synthesis. Despite decades of research on this topic, F 0 estimation at low signal-to-noise ratios (SNRs) in unexpected noise conditions remains difficult. This work proposes a new approach to noise robust F 0 estimation using a recurrent neural network (RNN) trained in a supervised manner. Recent studies employ deep neural networks (DNNs) for F 0 … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
16
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
2

Relationship

1
4

Authors

Journals

citations
Cited by 8 publications
(17 citation statements)
references
References 26 publications
1
16
0
Order By: Relevance
“…The counter-intuitive behavior of the FPE curve for GteAug and SRH can be explained by the increasing number of voicing errors: as more low-energy frames, from which F0 is generally harder to detect, are classified as unvoiced when SNR decreases, the number of frames from which FPE is computed decreases. These results also compare favorably against the DNN-based results recently reported on a subset of the same PTDB-TUG corpus [10]. However, direct comparison is difficult, as the authors of [10] performed cropping of silence regions in the signals before SNR calculation.…”
Section: Resultssupporting
confidence: 62%
See 3 more Smart Citations
“…The counter-intuitive behavior of the FPE curve for GteAug and SRH can be explained by the increasing number of voicing errors: as more low-energy frames, from which F0 is generally harder to detect, are classified as unvoiced when SNR decreases, the number of frames from which FPE is computed decreases. These results also compare favorably against the DNN-based results recently reported on a subset of the same PTDB-TUG corpus [10]. However, direct comparison is difficult, as the authors of [10] performed cropping of silence regions in the signals before SNR calculation.…”
Section: Resultssupporting
confidence: 62%
“…However, direct comparison is difficult, as the authors of [10] performed cropping of silence regions in the signals before SNR calculation. Still, GPE in [10] is always substantially larger than in GteAug across the entire SNR range, while FPE is similar to the now-proposed approach.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…In [21], the authors used a CNN to optimize both a classification and regression cost, where a GCI is simultaneously detected and localized in a frame. Other recent related works used regression-based approaches with neural networks for f0 [22] or glottal source parameters estimation (including GCI) [23]. However, those approaches all rely on EGG signals for establishing the ground truth reference used for training the networks.…”
Section: Introductionmentioning
confidence: 99%