Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1757
|View full text |Cite
|
Sign up to set email alerts
|

A New Glottal Neural Vocoder for Speech Synthesis

Abstract: Direct modeling of waveform generation for speech synthesis, e.g. WaveNet, has made significant progress on improving the naturalness and clarity of TTS. Such deep neural network-based models can generate highly realistic speech but at high computational and memory costs. We propose here a novel neural glottal vocoder which tends to bridge the gap between the traditional parametric vocoder and end-to-end speech sample generation. In the analysis, speech signals are decomposed into corresponding glottal source … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
9
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 15 publications
(9 citation statements)
references
References 12 publications
0
9
0
Order By: Relevance
“…We use REAPER for GCI and pitch estimation [31]. Similar phase-locked representations have been successfully applied not only in our previous work ( [30,24,25]), but also in [23,32]. Furthermore, using GCIs to center waveforms in a window can be seen analogous to using facial landmarks to center images, as done in the highly successful CelebA-HQ dataset [10].…”
Section: Waveform Representationmentioning
confidence: 99%
See 1 more Smart Citation
“…We use REAPER for GCI and pitch estimation [31]. Similar phase-locked representations have been successfully applied not only in our previous work ( [30,24,25]), but also in [23,32]. Furthermore, using GCIs to center waveforms in a window can be seen analogous to using facial landmarks to center images, as done in the highly successful CelebA-HQ dataset [10].…”
Section: Waveform Representationmentioning
confidence: 99%
“…Early approaches used a point-wise least squares loss in time-domain [20,21], and while this captures the gross waveshape well, the produced output is essentially a conditional average and lacks in stochastic high frequency contents (due to the averaging). The missing stochastic component can be recreated using signal processing techniques for aperiodicity modification, resulting in high quality synthetic speech [22,23], but this involves making signal model assumptions that may not hold generally. Further efforts have been made to model the stochastic part directly using GANs [24] or WaveNet ("GlotNet") [21].…”
Section: Introductionmentioning
confidence: 99%
“…For the separate frame-rate conditioning model, the parameter count is similar to the WaveNets, but the FLOPS count is significantly lower due to the operation at 200Hz rate (as opposed to 16kHz). For comparison, [19] reported their WaveNet running at 209G FLOPS, although the paper lacks information on the exact model configuration and how the FLOPS count was estimated. For further comparison, our estimation method gives 1008G FLOPS for the model configuration proposed in [16] (30 dilation layers, 256 residual channels, 2048 skip channels).…”
Section: A Computational Complexitymentioning
confidence: 99%
“…Additionally, acoustic conditioning enables the same models to be used in other (non-TTS) waveform generation tasks, such as voice conversion [18]. Despite their recent success, WaveNet models suffer from their need for substantial amounts of training data and large model sizes, making them expensive to train and use [19]. Furthermore, the autoregressive nature of WaveNets makes inference inherently slow.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation