2018
DOI: 10.48550/arxiv.1811.06292
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Towards achieving robust universal neural vocoding

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
12
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
8

Relationship

2
6

Authors

Journals

citations
Cited by 9 publications
(13 citation statements)
references
References 0 publications
1
12
0
Order By: Relevance
“…Note that there is no overlap between the training data of Voxceleb1 for neural vocoder and the evaluation data of speaker verification. To further show that the vocoder adopted in the proposed method is dataset independent, we also trained a universal vocoder [41,42] with the same structure as Vocoder, but on Lrg dataset from [42], which is a large speech dataset containing 6 languages and more than 600 speakers. The vocoder trained on Lrg is denoted as "Vocoder-L".…”
Section: Griffin-lim and Parallel Waveganmentioning
confidence: 99%
“…Note that there is no overlap between the training data of Voxceleb1 for neural vocoder and the evaluation data of speaker verification. To further show that the vocoder adopted in the proposed method is dataset independent, we also trained a universal vocoder [41,42] with the same structure as Vocoder, but on Lrg dataset from [42], which is a large speech dataset containing 6 languages and more than 600 speakers. The vocoder trained on Lrg is denoted as "Vocoder-L".…”
Section: Griffin-lim and Parallel Waveganmentioning
confidence: 99%
“…The model, which is described in [5], contains two Gated Recurrent Units (GRU) and two dense layers. We first concatenate the quantized speaker embedding to the quantized encoding and pass the resulting tensor through the first GRU.…”
Section: Autoregressive Decodermentioning
confidence: 99%
“…Recently, there have been successful efforts in building learned codecs; starting with replacing the decoders with learned decoders for fixed encoders, which can operate as low as 2.4 kb/s to 1.6 kb/s [2,3]. These learned decoders leverage advances in speech synthesizing generative models such as WaveNet, WaveRNN, and LPCNet [4,5,6]. More recently, in [7,8], the encoder and decoder were both learned in a joint fashion, by using quantized bottlenecks based on Vector-Quantized Variational Auto-Encoders (VQ-VAE) [9], and soft-to-hard quantizers.…”
Section: Introductionmentioning
confidence: 99%
“…First, a seq2seq acoustic model predicts mel-spectrograms from a sequence of phoneme-level linguistic inputs. Then a speaker-independent neural vocoder converts the mel-spectrograms into a highfidelity audio waveform [12].…”
Section: Baseline Modelmentioning
confidence: 99%