Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1016
|View full text |Cite
|
Sign up to set email alerts
|

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
22
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 54 publications
(27 citation statements)
references
References 0 publications
0
22
0
Order By: Relevance
“…The LeakyReLU with đ›Œ = 0.2 is set as the activation function for all layers except for the input layer. Most GAN-based vocoders are also trained with multi-scale discriminators by pre-processing the waveform into waveforms with different sample rates [16] or spectrograms with different STFT parameters [22]. However, it is difficult to do so in the training of the acoustic model, since the Mel spectrogram is hard to be down-sampled well or converted to other features with different scales.…”
Section: Model Architecturementioning
confidence: 99%
See 1 more Smart Citation
“…The LeakyReLU with đ›Œ = 0.2 is set as the activation function for all layers except for the input layer. Most GAN-based vocoders are also trained with multi-scale discriminators by pre-processing the waveform into waveforms with different sample rates [16] or spectrograms with different STFT parameters [22]. However, it is difficult to do so in the training of the acoustic model, since the Mel spectrogram is hard to be down-sampled well or converted to other features with different scales.…”
Section: Model Architecturementioning
confidence: 99%
“…To address the aforementioned problem, the generative adversarial network (GAN) shows great potential, which has been widely applied in speech synthesis, including statistical speech synthesis [12,13], spectrogram post-filter [14], spectrogram super-resolution [15], and neural vocoder [16,17,18]. In this framework, NAR-TTS can be enhanced by only using a discriminator.…”
Section: Introductionmentioning
confidence: 99%
“…Mel-spectrograms were obtained by applying an 80 band mel filter bank. We adopted Univnet-c16 [17] as a vocoder, which has beneficial lightweight properties by using a locationvariable convolution (LVC) technique [18]. The dimension of all hidden embeddings was set to 256, and the receptive field of the vocoder, auxiliary predictor, and the conditional discriminator was set to 19.…”
Section: Experiments 41 Experimental Setupmentioning
confidence: 99%
“…An acoustic feature generator can be autoregressive and attention-based for implicit speech-text alignments [1], [2] or it can be non-autoregressive for efficient parallel inference and duration informed for robustness on synthesis error [3], [4], [5]. There is lots of research on neural vocoder as well and some of the famous, widely used include [6], [7], normalizing flow based [8] and generative adversarial network (GAN) based [9], [10], [11], [12].…”
Section: Introductionmentioning
confidence: 99%
“…Note that a neural vocoder takes a ground-truth acoustic feature for training and a predicted one from an acoustic feature generator for inference. For optimal performance, we can further train a pre-trained neural vocoder with predicted acoustic features, which is called fine-tuning [12], [13]. Or we can train a neural vocoder with predicted acoustic feature from the beginning [1].…”
Section: Introductionmentioning
confidence: 99%