2017
DOI: 10.48550/arxiv.1711.10433
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Parallel WaveNet: Fast High-Fidelity Speech Synthesis

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
70
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 45 publications
(70 citation statements)
references
References 0 publications
0
70
0
Order By: Relevance
“…The Vector Quantized Variational Autoencoder (VQ-VAE) (Van Den Oord et al, 2017) is a model that learns to compress high dimensional data points into a discretized latent space and reconstruct them. The encoder E(x) → h first encodes x into a series of latent vectors h which is then discretized by performing a nearest neighbors lookup in a codebook of embeddings C = {e i } K i=1 of size K. The decoder D(e) → x then learns to reconstruct x from the quantized encodings.…”
Section: Vq-vaementioning
confidence: 99%
See 2 more Smart Citations
“…The Vector Quantized Variational Autoencoder (VQ-VAE) (Van Den Oord et al, 2017) is a model that learns to compress high dimensional data points into a discretized latent space and reconstruct them. The encoder E(x) → h first encodes x into a series of latent vectors h which is then discretized by performing a nearest neighbors lookup in a codebook of embeddings C = {e i } K i=1 of size K. The decoder D(e) → x then learns to reconstruct x from the quantized encodings.…”
Section: Vq-vaementioning
confidence: 99%
“…Deep generative models of multiple types (Kingma & Welling, 2013;Goodfellow et al, 2014;van den Oord et al, 2016b;Dinh et al, 2016) have seen incredible progress in the last few years on multiple modalities including natural images (van den Oord et al, 2016c;Zhang et al, 2019;Brock et al, 2018;Kingma & Dhariwal, 2018;Ho et al, 2019a;Karras et al, 2017;Van Den Oord et al, 2017;Razavi et al, 2019;Vahdat & Kautz, 2020;Ho et al, 2020;Chen et al, 2020;Ramesh et al, 2021), audio waveforms conditioned on language features (van den Oord et al, 2016a;Oord et al, 2017;Prenger et al, 2019;Bińkowski et al, 2019), natural language in the form of text (Radford et al, 2019;Brown et al, 2020), and music generation (Dhariwal et al, 2020). These results have been made possible thanks to fundamental advances in deep learning architectures (He et al, 2015;van den Oord et al, 2016b;Vaswani et al, 2017;Zhang et al, 2019;Menick & Kalchbrenner, 2018) as well as the availability of compute resources (Jouppi et al, 2017;Amodei & Hernandez, 2018) that are more powerful and plentiful than a few years ago.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Therefore, later sound generation schemes adopted a WaveNet network for conditioning. The majority of these works conditioned their model to spectrograms [31] [52] [49] [53] [33] [54] [28] [55] while others included linguistic features and pitch information [12] [56], phoneme encodings [6], features extracted from the STRAIGHT vocoder [57] or even MIDI representations [58].…”
Section: A Additional Inputmentioning
confidence: 99%
“…This architecture manages to increase the performance of autoregressive models since the sampling can be processed in parallel. Using Inverse Autoregressive Flows (IAF), Oord et al increased the efficiency of WaveNet [12]. Their implementation follows a "probability density distillation" where a pre-trained WaveNet model is used as a teacher and scores the samples a WaveNet student outputs.…”
Section: B Normalizing Flowmentioning
confidence: 99%