Low Bit-rate Speech Coding with VQ-VAE and a WaveNet Decoder

Gârbacea, Cristina; Oord, Aäron van den; Li, Yazhe; Lim, Felicia S. C.; Luebs, Alejandro; Vinyals, Oriol; Walters, Thomas C.

doi:10.1109/icassp.2019.8683277

Cited by 91 publications

(66 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Hence, to capture any temporal correlation, we require timedomain coupling in the autoencoding process, which can be achieved by either (1) applying a feed-forward network that can access data from multiple time-steps, e.g. using temporal convolution [8] or selfattention [12], or (2) using recurrent network architectures. In this work, we focus on the latter approach.…”

Section: Recurrent Autoencodermentioning

confidence: 99%

“…It is worth noting that the video compression framework of DVC [13] can be viewed as an instantiation of Fig. 1(f) where decoded data in the previous time steps are fed back to the encoder for explicit motion and residual information compression, and the one proposed in VQ-VAE [7,8] can be viewed as a convolutional variant of Fig. 1(d) where both encoder and decoder use convolution to cover a large temporal receptive field without any decoder-to-encoder feedback.…”

Section: Recurrent Autoencodermentioning

confidence: 99%

See 1 more Smart Citation

Feedback Recurrent Autoencoder

Yang

Sautiere²,

Ryu

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this work, we propose a new recurrent autoencoder architecture, termed Feedback Recurrent AutoEncoder (FRAE), for online compression of sequential data with temporal dependency. The recurrent structure of FRAE is designed to efficiently extract the redundancy along the time dimension and allows a compact discrete representation of the data to be learned. We demonstrate its effectiveness in speech spectrogram compression. Specifically, we show that the FRAE, paired with a powerful neural vocoder, can produce high-quality speech waveforms at a low, fixed bitrate. We further show that by adding a learned prior for the latent space and using an entropy coder, we can achieve an even lower variable bitrate.

show abstract

Section: Recurrent Autoencodermentioning

confidence: 99%

Section: Recurrent Autoencodermentioning

confidence: 99%

Feedback Recurrent Autoencoder

Yang

Sautiere²,

Ryu

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…While generative autoregressive models, such as WaveNet, have greatly improved the synthesized speech quality [12], it comes at the cost of model complexity during the decoding process [13]. For example, vector quantized variational autoencoders (VQ-VAE) with WaveNet decoder achieves impressive speech quality at a very low bitrate of 1.6 kbps, yet with approximately 20 million trainable parameters [14]. To make such a system more efficient, LPC can still unload computational overheads from neural networks.…”

Section: Introductionmentioning

confidence: 99%

Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization

Zhen

Lee

Sung

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Scalability and efficiency are desired in neural speech codecs, which supports a wide range of bitrates for applications on various devices. We propose a collaborative quantization (CQ) scheme to jointly learn the codebook of LPC coefficients and the corresponding residuals. CQ does not simply shoehorn LPC to a neural network, but bridges the computational capacity of advanced neural network models and traditional, yet efficient and domain-specific digital signal processing methods in an integrated manner. We demonstrate that CQ achieves much higher quality than its predecessor at 9 kbps with even lower model complexity. We also show that CQ can scale up to 24 kbps where it outperforms AMR-WB and Opus. As a neural waveform codec, CQ models are with less than 1 million parameters, significantly less than many other generative models.

show abstract

“…Many DNN methods [11][12] take inputs in time-frequency (T-F) domain from short time Fourier transform (STFT) or modified discrete cosine transform (MDCT), etc. Recent DNN-based codecs [13][14] [15] [16] model speech signals in time domain directly without T-F transformation. They are referred to as endto-end methods, yielding competitive performance comparing with current speech coding standards, such as AMR-WB [7].…”

Section: Introductionmentioning

confidence: 99%

“…Many DNN-based codecs achieve both low bitrates and high perceptual quality, two main targets for speech codecs [17][18] [19], but with a high model complexity. A WaveNet based variational autoencoder (VAE) [16] outperforms other low bitrate codecs in the listening test, however, with 20 millions parameters, a too big model for real-time processing in a resource-constrained device. Similarly, codecs built on Sam-pleRNN [20] [21] can also be energy-intensive.…”

Section: Introductionmentioning

confidence: 99%

Cascaded Cross-Module Residual Learning Towards Lightweight End-to-End Speech Coding

Zhen¹,

Sung²,

Lee³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

Speech codecs learn compact representations of speech signals to facilitate data transmission. Many recent deep neural network (DNN) based end-to-end speech codecs achieve low bitrates and high perceptual quality at the cost of model complexity. We propose a cross-module residual learning (CMRL) pipeline as a module carrier with each module reconstructing the residual from its preceding modules. CMRL differs from other DNN-based speech codecs, in that rather than modeling speech compression problem in a single large neural network, it optimizes a series of less-complicated modules in a two-phase training scheme. The proposed method shows better objective performance than AMR-WB and the state-of-the-art DNNbased speech codec with a similar network architecture. As an end-to-end model, it takes raw PCM signals as an input, but is also compatible with linear predictive coding (LPC), showing better subjective quality at high bitrates than AMR-WB and OPUS. The gain is achieved by using only 0.9 million trainable parameters, a significantly less complex architecture than the other DNN-based codecs in the literature. Index Terms: speech coding, deep neural network, entropy coding, residual learning Model descriptionBefore introducing CMRL as a module carrier, we describe the component module to be hosted by CMRL. The component moduleRecently, an end-to-end DNN speech codec (referred to as Kankanahalli-Net) has shown competitive performance comparable to one of the standards (AMR-WB) [14]. We describe our component model derived from Kankanahalli-Net that consists of bottleneck residual learning [24], soft-to-hard quantization [25], and sub-pixel convolutional neural networks for upsampling [26]. Figure 1 depicts the component module.

show abstract

Low Bit-rate Speech Coding with VQ-VAE and a WaveNet Decoder

Cited by 91 publications

References 12 publications

Feedback Recurrent Autoencoder

Feedback Recurrent Autoencoder

Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization

Cascaded Cross-Module Residual Learning Towards Lightweight End-to-End Speech Coding

Contact Info

Product

Resources

About