Towards Unsupervised Speech Recognition and Synthesis with Quantized Speech Representation Learning

Liu, Alexander H.; Tu, Tao; Lee, Hung-yi; Lee, Lin-Shan

doi:10.1109/icassp40776.2020.9053571

Cited by 36 publications

(19 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are various methods for learning quantized latent features and in this study, we focus on two popular approaches: quantized latent features can be learnt through autoencoding, which reconstructs the original signal, either the raw waveform or spectrogram features [4,6]. Another approach learns latent features through predicting representations of future time-steps [7,8,9,10]. * Work done during a Facebook AI Residency.…”

Section: Introductionmentioning

confidence: 99%

A Comparison of Discrete Latent Variable Models for Speech Representation Learning

Zhou

Baevski

Auli

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Neural latent variable models enable the discovery of interesting structure in speech audio data. This paper presents a comparison of two different approaches which are broadly based on predicting future time-steps or auto-encoding the input signal. Our study compares the representations learned by vqvae and vq-wav2vec in terms of sub-word unit discovery and phoneme recognition performance. Results show that future time-step prediction with vq-wav2vec achieves better performance. The best system achieves an error rate of 13.22 on the ZeroSpeech 2019 ABX phoneme discrimination challenge.

show abstract

Section: Introductionmentioning

confidence: 99%

A Comparison of Discrete Latent Variable Models for Speech Representation Learning

Zhou

Baevski

Auli

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Furthermore, the investigation of these pre-trained vector representations, in [22], shows that the pre-trained quantized vector representations of vowel phonemes have locations in latent space similar to those shown in the IPA vowel chart defined by human experts. This means that these latent representations are capable of representing speech data phonetically, and therefore can be used in end-to-end ASR systems as well as other non-ASR applications such as pronunciation and fluency assessment.…”

Section: Related Workmentioning

confidence: 89%

Self-Supervised End-to-End ASR for Low Resource L2 Swedish

et al. 2021

View full text Add to dashboard Cite

Unlike traditional (hybrid) Automatic Speech Recognition (ASR), end-to-end ASR systems simplify the training procedure by directly mapping acoustic features to sequences of graphemes or characters, thereby eliminating the need for specialized acoustic, language, or pronunciation models. However, one drawback of end-to-end ASR systems is that they require more training data than conventional ASR systems to achieve similar word error rate (WER). This makes it difficult to develop ASR systems for tasks where transcribed target data is limited such as developing ASR for Second Language (L2) speakers of Swedish. Nonetheless, recent advancements in selfsupervised acoustic learning, manifested in wav2vec models [1, 2, 3], leverage the available untranscribed speech data to provide compact acoustic representation that can achieve low WER when incorporated in end-to-end systems. To this end, we experiment with several monolingual and cross-lingual selfsupervised acoustic models to develop end-to-end ASR system for L2 Swedish. Even though our test is very small, it indicates that these systems are competitive in performance with traditional ASR pipeline. Our best model seems to reduce the WER by 7% relative to our traditional ASR baseline trained on the same target data.

show abstract

“…In the future, we will try to improve the TSA so that it can be used in current utterance, and try to use some unsupervised/semisupervised methods [28,29] to extract the word embedding better.…”

Section: Discussionmentioning

confidence: 99%

History Utterance Embedding Transformer LM for Speech Recognition

Deng

Cheng

Miao

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

History utterances contain rich contextual information; however, better extracting information from the history utterances and using it to improve the language model (LM) is still challenging. In this paper, we propose the history utterance embedding Transformer LM (HTLM), which includes an embedding generation network for extracting contextual information contained in the history utterances and a main Transformer LM for current prediction. In addition, the two-stage attention (TSA) is proposed to encode richer contextual information into the embedding of history utterances (h-emb) while supporting GPU parallel training. Furthermore, we combine the extracted h-emb and embedding of current utterance (c-emb) through the dot-product attention and a fusion method for HTLM's current prediction. Experiments are conducted on the HKUST dataset and achieve a 23.4% character error rate (CER) on the test set. Compared with the baseline, the proposed method yields 12.86 absolute perplexity reduction and 0.8% absolute CER reduction.

show abstract

Towards Unsupervised Speech Recognition and Synthesis with Quantized Speech Representation Learning

Cited by 36 publications

References 23 publications

A Comparison of Discrete Latent Variable Models for Speech Representation Learning

A Comparison of Discrete Latent Variable Models for Speech Representation Learning

Self-Supervised End-to-End ASR for Low Resource L2 Swedish

History Utterance Embedding Transformer LM for Speech Recognition

Contact Info

Product

Resources

About