2019 International Multi-Conference on Engineering, Computer and Information Sciences (SIBIRCON) 2019
DOI: 10.1109/sibircon48586.2019.8957862
|View full text |Cite
|
Sign up to set email alerts
|

Reducing over-smoothness in speech synthesis using Generative Adversarial Networks

Abstract: Speech synthesis is widely used in many practical applications. In recent years, speech synthesis technology has developed rapidly. However, one of the reasons why synthetic speech is unnatural is that it often has over-smoothness. In order to improve the naturalness of synthetic speech, we first extract the mel-spectrogram of speech and convert it into a real image, then take the over-smooth mel-spectrogram image as input, and use image-to-image translation Generative Adversarial Networks(GANs) framework to g… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 17 publications
0
4
0
Order By: Relevance
“…In addition to base Tacotron loss, we use guided attention loss for faster attention convergence. we also use the Structural Similarity Index (SSIM) loss [23] to increase the stability of the training and make Mel-spectrograms less blurry. Highquality vocoder can make the audio quality difference caused by spectral blurring more obvious.…”
Section: The Loss For Acoustic Modelingmentioning
confidence: 99%
“…In addition to base Tacotron loss, we use guided attention loss for faster attention convergence. we also use the Structural Similarity Index (SSIM) loss [23] to increase the stability of the training and make Mel-spectrograms less blurry. Highquality vocoder can make the audio quality difference caused by spectral blurring more obvious.…”
Section: The Loss For Acoustic Modelingmentioning
confidence: 99%
“…1) Big gap in naturalness between generated speech and realistic speech: the existing method in unconstrained lip-to-speech adopts the MSE criterion in predicting each spectrogram frame. Such design can not capture the correlation among frequency bins in a frame, which leads to over-smoothness in spectrogram (Sheng and Pavlovskiy 2019). 2) High inference latency: the existing method utilizes the autoregressive architecture, generating current frames conditioned on previous ones.…”
Section: Introductionmentioning
confidence: 99%
“…Early non-autoregressive TTS models (Ren et al, 2019;Peng et al, 2020) use mean absolute error (MAE) or mean square error (MSE) as loss function to model speech mel-spectrograms, implicitly assuming that data points in mel-spectrograms are independent to each other and follow a unimodal distribution 2 . Consequently, the melspectrograms following dependent and multimodal distributions cannot be well modeled by the MAE or MSE loss, which presents great challenges in non-autoregressive TTS modeling and causes over-smoothed (blurred) predictions in melspectrograms (Vasquez and Lewis, 2019;Sheng and Pavlovskiy, 2019).…”
Section: Introductionmentioning
confidence: 99%