The Speaker and Language Recognition Workshop (Odyssey 2018) 2018
DOI: 10.21437/odyssey.2018-39
|View full text |Cite
|
Sign up to set email alerts
|

A Regression Model of Recurrent Deep Neural Networks for Noise Robust Estimation of the Fundamental Frequency Contour of Speech

Abstract: The fundamental frequency (F 0) contour of speech is a key aspect to represent speech prosody that finds use in speech and spoken language analysis such as voice conversion and speech synthesis as well as speaker and language identification. This work proposes new methods to estimate the F 0 contour of speech using deep neural networks (DNNs) and recurrent neural networks (RNNs). They are trained using supervised learning with the ground truth of F 0 contours. The latest prior research addresses this problem f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
11
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
3
2

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(11 citation statements)
references
References 25 publications
0
11
0
Order By: Relevance
“…This work is an extension of our recent preliminary study [22]. In that study, we have successfully employed an RNN regression model, which maps spectral sequence directly onto F 0 values, to tackle the disadvantage in existing classification approaches mentioned above.…”
Section: Introductionmentioning
confidence: 97%
“…This work is an extension of our recent preliminary study [22]. In that study, we have successfully employed an RNN regression model, which maps spectral sequence directly onto F 0 values, to tackle the disadvantage in existing classification approaches mentioned above.…”
Section: Introductionmentioning
confidence: 97%
“…This work is an extension of our two recent, preliminary studies [15], [24], in which we introduced two types of RNN regression models for F0 estimation. Our first approach [15] uses the spectral magnitude to predict the (scalar) F0 value directly, while our second model [24] first maps the raw waveform input into a sinusoid (vector) oscillating with F0. F0 can easily be extracted from this sinusoid using standard signal processing operations (here, through autocorrelation).…”
Section: Introductionmentioning
confidence: 97%
“…PEFAC applies matched filters and autocorrelation in the log-frequency domain to achieve noise robustness. Nevertheless, the accuracy remains unsatisfactory in severe noise conditions, such as signal-to-noise ratios (SNRs) less than or equal to 0 dB [13], [15].…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations