2018
DOI: 10.1109/taslp.2017.2761547
|View full text |Cite
|
Sign up to set email alerts
|

Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

Abstract: A method for statistical parametric speech synthesis incorporating generative adversarial networks (GANs) is proposed. Although powerful deep neural networks (DNNs) techniques can be applied to artificially synthesize speech waveform, the synthetic speech quality is low compared with that of natural speech. One of the issues causing the quality degradation is an over-smoothing effect often observed in the generated speech parameters. A GAN introduced in this paper consists of two neural networks: a discriminat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
87
0
1

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
3
2

Relationship

1
9

Authors

Journals

citations
Cited by 199 publications
(89 citation statements)
references
References 41 publications
1
87
0
1
Order By: Relevance
“…The V2S attack follows the former and tries to reproduce the targeted speaker's voice from the ASV model. A similar idea was used in Saito et al's work [6] that incorporated a voice anti-spoofing (i.e., a discriminative model to detect spoofing attacks) into training of a VC model for reproducing fine structures of the synthesized voice.…”
Section: Discussionmentioning
confidence: 99%
“…The V2S attack follows the former and tries to reproduce the targeted speaker's voice from the ASV model. A similar idea was used in Saito et al's work [6] that incorporated a voice anti-spoofing (i.e., a discriminative model to detect spoofing attacks) into training of a VC model for reproducing fine structures of the synthesized voice.…”
Section: Discussionmentioning
confidence: 99%
“…SS is now able to generate high-quality voice due to recent advances in unit selection [45], statistical parametric [46], hybrid [47], and DNN-based TTS methods. Recently, deep learning-based techniques, such as Generative Adversarial Network (GAN) [48], Tacotron [49], Wavenet [50], etc., are able to produce very natural sounding speech both in timbre and prosody. SS uses properties of a claimed speaker's voice characteristics and spectral cues of the natural speech.…”
Section: B) Synthetic Speechmentioning
confidence: 99%
“…In addition to the inverter, we also have a discriminator module. The discriminator predicts whether the given spectrogram is real data or is generated by the inverter, which generates a realistic spectrogram to deceive the discriminator [14,15,16]. The Code2Spec inverter has several training objectives:…”
Section: Vector Quantized Variational Autoencoder (Vq-vae)mentioning
confidence: 99%