2019
DOI: 10.48550/arxiv.1912.01167
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

High-quality Speech Synthesis Using Super-resolution Mel-Spectrogram

Abstract: In speech synthesis and speech enhancement systems, melspectrograms need to be precise in acoustic representations. However, the generated spectrograms are over-smooth, that could not produce high quality synthesized speech. Inspired by image-to-image translation, we address this problem by using a learning-based post filter combining Pix2PixHD and ResUnet to reconstruct the mel-spectrograms together with super-resolution. From the resulting super-resolution spectrogram networks, we can generate enhanced spect… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
3
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 27 publications
0
3
0
Order By: Relevance
“…Inspired by the human hearing perception, the MFCC has emerged as a widely accepted method in acoustic signal processing. However, the discrete cosine transform (DCT) in the MFCC approach may filter out a substantial amount of valuable data [39]. Deep learning has a powerful ability to extract features, and researchers prefer to extract features from the Mel spectrogram which contains more information.…”
Section: Feature Preparingmentioning
confidence: 99%
“…Inspired by the human hearing perception, the MFCC has emerged as a widely accepted method in acoustic signal processing. However, the discrete cosine transform (DCT) in the MFCC approach may filter out a substantial amount of valuable data [39]. Deep learning has a powerful ability to extract features, and researchers prefer to extract features from the Mel spectrogram which contains more information.…”
Section: Feature Preparingmentioning
confidence: 99%
“…The authors in [13] introduce an end-toend GAN based system for speech bandwidth extension for use in downstream automatic speech recognition. The authors in [14] suggest to use a WaveNet model to directly output a high sampled speech signal while the authors in [15] suggest using GANs to estimate the mel-spectrogram and then apply a vocoder to generate enhanced waveform.…”
Section: Related Workmentioning
confidence: 99%
“…In [31], the authors applied the adversarial training to an super-resolution network converting mel spectrogram that generated by the AR mel-synthesis network to linear spectrogram. In addition, in [32], the authors found that the mel spectrogram generated by the TTS model trained with only reconstruction loss was over-smooth, and proposed an enhancer model for mel spectrogram that reduces the gap between the true mel spectrogram and the generated mel spectrogram.…”
Section: Introductionmentioning
confidence: 99%