Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1238
|View full text |Cite
|
Sign up to set email alerts
|

VocGAN: A High-Fidelity Real-Time Vocoder with a Hierarchically-Nested Adversarial Network

Abstract: We present a novel high-fidelity real-time neural vocoder called VocGAN. A recently developed GAN-based vocoder, MelGAN, produces speech waveforms in real-time. However, it often produces a waveform that is insufficient in quality or inconsistent with acoustic characteristics of the input mel spectrogram. VocGAN is nearly as fast as MelGAN, but it significantly improves the quality and consistency of the output waveform. VocGAN applies a multi-scale waveform generator and a hierarchically-nested discriminator … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
38
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
4

Relationship

0
10

Authors

Journals

citations
Cited by 54 publications
(44 citation statements)
references
References 12 publications
0
38
0
Order By: Relevance
“…GAN-TTS also adopts an ensemble of 10 similar discriminators with different input window sizes with or without the conditional acoustic features to guide its generator to learn different aspects of speech information. Furthermore, the variants of MelGAN such as VocGAN [28] adopted a multi-scale generator and a hierarchically-nested discriminator and multiband MelGAN [29] incorporated a multi-band technique into MelGAN also achieved further speech quality or generative efficiency improvements.…”
Section: Natural Speechmentioning
confidence: 99%
“…GAN-TTS also adopts an ensemble of 10 similar discriminators with different input window sizes with or without the conditional acoustic features to guide its generator to learn different aspects of speech information. Furthermore, the variants of MelGAN such as VocGAN [28] adopted a multi-scale generator and a hierarchically-nested discriminator and multiband MelGAN [29] incorporated a multi-band technique into MelGAN also achieved further speech quality or generative efficiency improvements.…”
Section: Natural Speechmentioning
confidence: 99%
“…Similar ideas are used in Multiband-MelGAN [15], which generates each sub-band of the target speech separately, saving computational power, and then obtains the final waveform using a synthesis PQMF. Research in this field is very active and we can cite the very recent GAN vocoders such as VocGan [16] and HiFi-GAN [17].…”
Section: Related Workmentioning
confidence: 99%
“…HiFi-GAN [19] consists of small sub-discriminators obtaining specific periodic parts of raw waveforms, achieving higher computational efficiency and sample quality. VocGAN [41] applies the joint conditional and unconditional objective, which is inspired by successful results in high-resolution image synthesis. Although these vocoders could be applied in SVS systems, distinct degradations occur when generalizing those systems to unseen singers.…”
Section: Vocodermentioning
confidence: 99%