ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8683271
|View full text |Cite
|
Sign up to set email alerts
|

Waveform Generation for Text-to-speech Synthesis Using Pitch-synchronous Multi-scale Generative Adversarial Networks

Abstract: The state-of-the-art in text-to-speech synthesis has recently improved considerably due to novel neural waveform generation methods, such as WaveNet. However, these methods suffer from their slow sequential inference process, while their parallel versions are difficult to train and even more expensive computationally. Meanwhile, generative adversarial networks (GANs) have achieved impressive results in image generation and are making their way into audio applications; parallel inference is among their lucrativ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
16
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
1
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 18 publications
(17 citation statements)
references
References 22 publications
1
16
0
Order By: Relevance
“…Our previous works [18,19] show that the capacity of AR vocoder is highly related to the length of the receptive field, and we argue that the No-AR vocoder has a similar tendency. Specifically, the receptive field length of PWG_30 is 6139 (2 0 +...+2 9 =1023 with three cycles and two sides plus one) and that of PWG_20 is 4093.…”
Section: A Effective Receptive Fieldsupporting
confidence: 52%
See 2 more Smart Citations
“…Our previous works [18,19] show that the capacity of AR vocoder is highly related to the length of the receptive field, and we argue that the No-AR vocoder has a similar tendency. Specifically, the receptive field length of PWG_30 is 6139 (2 0 +...+2 9 =1023 with three cycles and two sides plus one) and that of PWG_20 is 4093.…”
Section: A Effective Receptive Fieldsupporting
confidence: 52%
“…1, the conventional parametric vocoders generate speech samples in an AR manner such as LPC vocoders [47], [48] and mel-generalized cepstrum (MGC) vocoders [49], [50] or in a non-AR manner such as STRAIGHT [4] and WORLD [5]. Motivated by the development of deep NNs, NNbased excitation generation models with the AR mechanism such as LPCNet [11] and the non-AR mechanism such as GlotGAN [17], [18] and GELP [19] have been proposed to improve the generated speech quality. Moreover, the authors of [31] and [32] also proposed a neural source-filter (NSF) network to model the source-filter generative framework with an advanced neural filter.…”
Section: A Source-filter and Data-driven Vocodersmentioning
confidence: 99%
See 1 more Smart Citation
“…The current study adapts the loss functions from our previous work [22]. However, now we can define the losses directly in speech domain, since the LP synthesis filter and overlap-add are integrated to the computation graph.…”
Section: Lossesmentioning
confidence: 99%
“…In this paper, we combine a MFCC-based envelope model [21] with recent GAN training insights [22]. Furthermore, we use neural net architectures that operate directly on raw audio, and integrate a parallel inference capable LP synthesis filter into the computation graph.…”
Section: Introductionmentioning
confidence: 99%