Parallel WaveNet: Fast High-Fidelity Speech Synthesis

Oord, Aäron van den; Li, Yazhe; Babuschkin, I.; Simonyan, Karen; Vinyals, Oriol; Kavukcuoglu, Koray; Driessche, George van den; Lockhart, Edward; Cobo, Luis C.; Stimberg, Florian; Casagrande, Norman; Grewe, Dominik; Noury, Seb; Dieleman, Sander; Elsen, Erich; Kalchbrenner, Nal; Zen, Heiga; Graves, Alex; King, Helen; Walters, Thomas C.; Belov, Dan; Hassabis, Demis

doi:10.48550/arxiv.1711.10433

Cited by 45 publications

(70 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Vector Quantized Variational Autoencoder (VQ-VAE) (Van Den Oord et al, 2017) is a model that learns to compress high dimensional data points into a discretized latent space and reconstruct them. The encoder E(x) → h first encodes x into a series of latent vectors h which is then discretized by performing a nearest neighbors lookup in a codebook of embeddings C = {e i } K i=1 of size K. The decoder D(e) → x then learns to reconstruct x from the quantized encodings.…”

Section: Vq-vaementioning

confidence: 99%

“…Deep generative models of multiple types (Kingma & Welling, 2013;Goodfellow et al, 2014;van den Oord et al, 2016b;Dinh et al, 2016) have seen incredible progress in the last few years on multiple modalities including natural images (van den Oord et al, 2016c;Zhang et al, 2019;Brock et al, 2018;Kingma & Dhariwal, 2018;Ho et al, 2019a;Karras et al, 2017;Van Den Oord et al, 2017;Razavi et al, 2019;Vahdat & Kautz, 2020;Ho et al, 2020;Chen et al, 2020;Ramesh et al, 2021), audio waveforms conditioned on language features (van den Oord et al, 2016a;Oord et al, 2017;Prenger et al, 2019;Bińkowski et al, 2019), natural language in the form of text (Radford et al, 2019;Brown et al, 2020), and music generation (Dhariwal et al, 2020). These results have been made possible thanks to fundamental advances in deep learning architectures (He et al, 2015;van den Oord et al, 2016b;Vaswani et al, 2017;Zhang et al, 2019;Menick & Kalchbrenner, 2018) as well as the availability of compute resources (Jouppi et al, 2017;Amodei & Hernandez, 2018) that are more powerful and plentiful than a few years ago.…”

Section: Introductionmentioning

confidence: 99%

“…Multiple classes of generative models have been shown to produce strikingly good samples such as autoregressive models (van den Oord et al, 2016b;Parmar et al, 2018;Menick & Kalchbrenner, 2018;Radford et al, 2019;Chen et al, 2020), generative adversarial networks (GANs) (Goodfellow et al, 2014;Radford et al, 2015), variational autoencoders (VAEs) (Kingma & Welling, 2013;Kingma et al, 2016;Vahdat & Kautz, 2020;Child, 2020), Flows (Dinh et al, 2014;Kingma & Dhariwal, 2018;Ho et al, 2019a), vector quantized VAE (VQ-VAE) (Van Den Oord et al, 2017;Razavi et al, 2019;Ramesh et al, 2021), and lately diffusion and score matching models (Sohl-Dickstein et al, 2015;Song & Ermon, 2019;Ho et al, 2020). These different generative model families have their tradeoffs across various dimensions: sampling speed, sample diversity, sample quality, optimization stability, compute requirements, ease of evaluation, and so forth.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

VideoGPT: Video Generation using VQ-VAE and Transformers

Yan,

Zhang,

Abbeel

et al. 2021

Preprint

View full text Add to dashboard Cite

We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPTlike architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with stateof-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural images from UCF-101 and Tumbler GIF Dataset (TGIF). We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models. Samples and code are available at https://wilson1yan. github.io/videogpt/index.html.

show abstract

Section: Vq-vaementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

VideoGPT: Video Generation using VQ-VAE and Transformers

Yan,

Zhang,

Abbeel

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Therefore, later sound generation schemes adopted a WaveNet network for conditioning. The majority of these works conditioned their model to spectrograms [31] [52] [49] [53] [33] [54] [28] [55] while others included linguistic features and pitch information [12] [56], phoneme encodings [6], features extracted from the STRAIGHT vocoder [57] or even MIDI representations [58].…”

Section: A Additional Inputmentioning

confidence: 99%

“…This architecture manages to increase the performance of autoregressive models since the sampling can be processed in parallel. Using Inverse Autoregressive Flows (IAF), Oord et al increased the efficiency of WaveNet [12]. Their implementation follows a "probability density distillation" where a pre-trained WaveNet model is used as a teacher and scores the samples a WaveNet student outputs.…”

Section: B Normalizing Flowmentioning

confidence: 99%

Audio representations for deep learning in sound synthesis: A review

Anastasia¹,

O’Leary²

2022

Preprint

View full text Add to dashboard Cite

The rise of deep learning algorithms has led many researchers to withdraw from using classic signal processing methods for sound generation. Deep learning models have achieved expressive voice synthesis, realistic sound textures, and musical notes from virtual instruments. However, the most suitable deep learning architecture is still under investigation. The choice of architecture is tightly coupled to the audio representations. A sound's original waveform can be too dense and rich for deep learning models to deal with efficientlyand complexity increases training time and computational cost. Also, it does not represent sound in the manner in which it is perceived. Therefore, in many cases, the raw audio has been transformed into a compressed and more meaningful form using upsampling, feature-extraction, or even by adopting a higher level illustration of the waveform. Furthermore, conditional on the form chosen, additional conditioning representations, different model architectures, and numerous metrics for evaluating the reconstructed sound have been investigated. This paper provides an overview of audio representations applied to sound synthesis using deep learning. Additionally, it presents the most significant methods for developing and evaluating a sound synthesis architecture using deep learning models, always depending on the audio representation.

show abstract

End-to-End Object Detection with Transformers

Carion

Massa

Synnaeve

et al. 2020

Lecture Notes in Computer Science

9,065

5,319

View full text Add to dashboard Cite

We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors. DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster R-CNN baseline on the challenging COCO object detection dataset. Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner. We show that it significantly outperforms competitive baselines. Training code and pretrained models are available at https://github.com/facebookresearch/detr.

show abstract

Parallel WaveNet: Fast High-Fidelity Speech Synthesis

Cited by 45 publications

References 0 publications

VideoGPT: Video Generation using VQ-VAE and Transformers

VideoGPT: Video Generation using VQ-VAE and Transformers

Audio representations for deep learning in sound synthesis: A review

End-to-End Object Detection with Transformers

Contact Info

Product

Resources

About