2020
DOI: 10.1109/taslp.2019.2956145
|View full text |Cite
|
Sign up to set email alerts
|

Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis

Abstract: Neural waveform models have demonstrated better performance than conventional vocoders for statistical parametric speech synthesis. One of the best models, called WaveNet, uses an autoregressive (AR) approach to model the distribution of waveform sampling points, but it has to generate a waveform in a time-consuming sequential manner. Some new models that use inverse-autoregressive flow (IAF) can generate a whole waveform in a one-shot manner but require either a larger amount of training time or a complicated… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
89
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 99 publications
(89 citation statements)
references
References 47 publications
0
89
0
Order By: Relevance
“…Motivated by the development of deep NNs, NNbased excitation generation models with the AR mechanism such as LPCNet [11] and the non-AR mechanism such as GlotGAN [17], [18] and GELP [19] have been proposed to improve the generated speech quality. Moreover, the authors of [31] and [32] also proposed a neural source-filter (NSF) network to model the source-filter generative framework with an advanced neural filter.…”
Section: A Source-filter and Data-driven Vocodersmentioning
confidence: 99%
See 2 more Smart Citations
“…Motivated by the development of deep NNs, NNbased excitation generation models with the AR mechanism such as LPCNet [11] and the non-AR mechanism such as GlotGAN [17], [18] and GELP [19] have been proposed to improve the generated speech quality. Moreover, the authors of [31] and [32] also proposed a neural source-filter (NSF) network to model the source-filter generative framework with an advanced neural filter.…”
Section: A Source-filter and Data-driven Vocodersmentioning
confidence: 99%
“…For instance, without explicitly modeling the excitation signals as conventional source-filter models, it is difficult for WN to generate speech with accurate pitches outside the fundamental frequency (F 0 ) range of training data when conditioned on the scaled F 0 feature [33], [34]. However, using carefully designed mixed periodic and aperiodic inputs and source-filter-like architectures, the authors of [30]- [32] proposed different NN-based models attaining pitch control- lability. In our previous works [33], [34], we also proposed a quasi-periodic WN (QPNet), which has a conventionalvocoding-like framework while using a unified network without the requirement of specific mixed inputs.…”
mentioning
confidence: 99%
See 1 more Smart Citation
“…the Mel-frequency cepstral coefficients, MFCCs, are used in human speech analysis; Chung et al, 2016; Chorowski et al, 2019; Tjandra et al, 2019), and/or adopt more anatomically/bio-acoustically realistic articulatory systems for the decoder module (cf. Wang et al, 2020, implemented the source-filter model of vocalization based on an artificial neural network). Such Embodied VAEs would allow constructive investigation of vocal learning beyond mere acoustic analysis.…”
Section: Discussionmentioning
confidence: 99%
“…We note that NSF is straightforward to train and fast to generate waveform. It is reported 100 times faster than WaveNet vocoder, and yet achieving comparable voice quality on a large speech corpus [108].…”
Section: A Speech Analysis and Reconstructionmentioning
confidence: 99%