Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1258
|View full text |Cite
|
Sign up to set email alerts
|

Nebula: F0 Estimation and Voicing Detection by Modeling the Statistical Properties of Feature Extractors

Abstract: A F0 and voicing status estimation algorithm for high quality speech analysis/synthesis is proposed. This problem is approached from a different perspective that models the behavior of feature extractors under noise, instead of directly modeling speech signals. Under time-frequency locality assumptions, the joint distribution of extracted features and target F0 can be characterized by training a bank of Gaussian mixture models (GMM) on artificial data generated from Monte-Carlo simulations. The trained GMMs ca… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 8 publications
0
2
0
Order By: Relevance
“…The 1D convolution operation within the module is performed with varying levels of dilations over the frame time axis to model the time structure of F0 contours. In this study, the used dilations for the 8-layer residual module stack are d = [1,2,4,8,1,2,4,8], yielding (including the postnet) a receptive field of 71 frames with the selected filter length of 5 (current frame plus 35 past frames and 35 future frames) that condition each F0 estimate. The output of the convolution is passed through tanh activation and multiplied by a gating activation produced by a similar operation with the logistic sigmoid activation function.…”
Section: Neural Network F0 Estimation Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…The 1D convolution operation within the module is performed with varying levels of dilations over the frame time axis to model the time structure of F0 contours. In this study, the used dilations for the 8-layer residual module stack are d = [1,2,4,8,1,2,4,8], yielding (including the postnet) a receptive field of 71 frames with the selected filter length of 5 (current frame plus 35 past frames and 35 future frames) that condition each F0 estimate. The output of the convolution is passed through tanh activation and multiplied by a gating activation produced by a similar operation with the logistic sigmoid activation function.…”
Section: Neural Network F0 Estimation Methodsmentioning
confidence: 99%
“…Frequency domain methods utilize, for example, the energy of the linear prediction residual harmonics (e.g., SRH [4]) or instantaneous frequency (e.g., TEMPO [5]). In addition to the methods that provide raw frame-level estimates of F0, multiple methods have been developed for candidate F0 selection and/or postprocessing of the raw F0 estimates for improved robustness (e.g., pYIN [6], YAAPT [7], and Nebula [8]).…”
Section: Introductionmentioning
confidence: 99%