ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413680
|View full text |Cite
|
Sign up to set email alerts
|

A Comparison of Discrete Latent Variable Models for Speech Representation Learning

Abstract: Neural latent variable models enable the discovery of interesting structure in speech audio data. This paper presents a comparison of two different approaches which are broadly based on predicting future time-steps or auto-encoding the input signal. Our study compares the representations learned by vqvae and vq-wav2vec in terms of sub-word unit discovery and phoneme recognition performance. Results show that future time-step prediction with vq-wav2vec achieves better performance. The best system achieves an er… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
8
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 8 publications
(10 citation statements)
references
References 11 publications
2
8
0
Order By: Relevance
“…A similar insight was obtained in [258], which compared vq-vae and vq-wav2vec with respect to their ability to discover phonetic units. The vq-vae model extracts continuous features from the audio signal; a quantizer then maps them into a discrete space, and a decoder is trained to reconstruct the original audio conditioned on the latent discrete representation and the past acoustic observations.…”
Section: B Training Criterionsupporting
confidence: 60%
“…A similar insight was obtained in [258], which compared vq-vae and vq-wav2vec with respect to their ability to discover phonetic units. The vq-vae model extracts continuous features from the audio signal; a quantizer then maps them into a discrete space, and a decoder is trained to reconstruct the original audio conditioned on the latent discrete representation and the past acoustic observations.…”
Section: B Training Criterionsupporting
confidence: 60%
“…8 to achieve 3kbps. This is quite different from the diversity loss in Gumbel-Softmax based method [26], where a uniform distribution is enforced on the codeword usage.…”
Section: E Vector Quantization With Rate Controlmentioning
confidence: 82%
“…Recently, it is also applied to discrete representation learning [12] and serves as the basis of end-to-end neural audio coding [6]- [11]. As quantization is inherently not differentiable, to enable end-to-end learning in neural audio coding, several ways have been proposed in the literature, including the one with commitment loss in VQ-VAE [12], EMA [12], Gumbel-Softmax based method [26] [27] and the soft-to-hard technique [28]. VQ-VAE [12] approximates the derivative by the identity function that directly copies gradients from the decoder input to the encoder output.…”
Section: Vector Quantizationmentioning
confidence: 99%
See 2 more Smart Citations