A Comparison of Discrete Latent Variable Models for Speech Representation Learning

Zhou, Henry; Baevski, Alexei; Auli, Michael

doi:10.1109/icassp39728.2021.9413680

Cited by 8 publications

(10 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A similar insight was obtained in [258], which compared vq-vae and vq-wav2vec with respect to their ability to discover phonetic units. The vq-vae model extracts continuous features from the audio signal; a quantizer then maps them into a discrete space, and a decoder is trained to reconstruct the original audio conditioned on the latent discrete representation and the past acoustic observations.…”

Section: B Training Criterionsupporting

confidence: 60%

Self-Supervised Speech Representation Learning: A Review

Mohamed¹,

Lee²,

Borgholt³

et al. 2022

Preprint

View full text Add to dashboard Cite

Section: B Training Criterionsupporting

confidence: 60%

Self-Supervised Speech Representation Learning: A Review

Mohamed¹,

Lee²,

Borgholt³

et al. 2022

Preprint

View full text Add to dashboard Cite

“…8 to achieve 3kbps. This is quite different from the diversity loss in Gumbel-Softmax based method [26], where a uniform distribution is enforced on the codeword usage.…”

Section: E Vector Quantization With Rate Controlmentioning

confidence: 82%

“…Recently, it is also applied to discrete representation learning [12] and serves as the basis of end-to-end neural audio coding [6]- [11]. As quantization is inherently not differentiable, to enable end-to-end learning in neural audio coding, several ways have been proposed in the literature, including the one with commitment loss in VQ-VAE [12], EMA [12], Gumbel-Softmax based method [26] [27] and the soft-to-hard technique [28]. VQ-VAE [12] approximates the derivative by the identity function that directly copies gradients from the decoder input to the encoder output.…”

Section: Vector Quantizationmentioning

confidence: 99%

“…As discussed in Section II.D, Gumbel-Softmax [26] [27] and soft-to-hard [28] methods introduce the probability of selecting a codeword and thus make rate control feasible. However, Gumbel-Softmax uses a linear projection to select the codeword without explicitly correlating it with the quantization error, as shown in Fig.…”

Section: E Vector Quantization With Rate Controlmentioning

confidence: 99%

“…3) Distance-Gumbel-Softmax-based VQ: We compare the Distance-Gumbel-Softmax-based VQ mechanism with the previous Gumbel-Softmax-based method in [26]. Table III shows that at 3kbps, our method outperforms the previous Gumbel-Softmax-based method in all metrics, indicating that the explict injection of distance information helps improve recon- We also show the distribution of the learned codebooks to help understand how the Distance-Gumbel-Softmax-based vector quantization learns.…”

Section: Ablation Studymentioning

confidence: 99%

See 2 more Smart Citations

Predictive Neural Speech Coding

Xue¹,

Peng²,

Xue³

et al. 2022

Preprint

View full text Add to dashboard Cite

Neural audio/speech coding has shown its capability to deliver a high quality at much lower bitrates than traditional methods recently. However, existing neural audio/speech codecs employ either acoustic features or learned blind features with a convolutional neural network for encoding, by which there are still temporal redundancies inside encoded features. This paper introduces latent-domain predictive coding into the VQ-VAE framework to fully remove such redundancies and proposes the TF-Codec for low-latency neural speech coding in an endto-end way. Specifically, the extracted features are encoded conditioned on a prediction from past quantized latent frames so that temporal correlations are further removed. What's more, we introduce a learnable compression on the time-frequency input to adaptively adjust the attention paid on main frequencies and details at different bitrates. A differentiable vector quantization scheme based on distance-to-soft mapping and Gumbel-Softmax is proposed to better model the latent distributions with rate constraint. Subjective results on multilingual speech datasets show that with a latency of 40ms, the proposed TF-Codec at 1kbps can achieve a much better quality than Opus 9kbps and TF-Codec at 3kbps outperforms both EVS 9.6kbps and Opus 12kbps. Numerous studies are conducted to show the effectiveness of these techniques.

show abstract

Learning Dependencies of Discrete Speech Representations with Neural Hidden Markov Models

Yeh

Tang

2023

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

A Comparison of Discrete Latent Variable Models for Speech Representation Learning

Cited by 8 publications

References 11 publications

Self-Supervised Speech Representation Learning: A Review

Self-Supervised Speech Representation Learning: A Review

Predictive Neural Speech Coding

Learning Dependencies of Discrete Speech Representations with Neural Hidden Markov Models

Contact Info

Product

Resources

About