ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053571
|View full text |Cite
|
Sign up to set email alerts
|

Towards Unsupervised Speech Recognition and Synthesis with Quantized Speech Representation Learning

Abstract: In this paper we propose a Sequential Representation Quantization AutoEncoder (SeqRQ-AE) to learn from primarily unpaired audio data and produce sequences of representations very close to phoneme sequences of speech utterances. This is achieved by proper temporal segmentation to make the representations phonemesynchronized, and proper phonetic clustering to have total number of distinct representations close to the number of phonemes. Mapping between the distinct representations and phonemes is learned from a … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
19
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
3
2

Relationship

1
9

Authors

Journals

citations
Cited by 36 publications
(19 citation statements)
references
References 23 publications
0
19
0
Order By: Relevance
“…There are various methods for learning quantized latent features and in this study, we focus on two popular approaches: quantized latent features can be learnt through autoencoding, which reconstructs the original signal, either the raw waveform or spectrogram features [4,6]. Another approach learns latent features through predicting representations of future time-steps [7,8,9,10]. * Work done during a Facebook AI Residency.…”
Section: Introductionmentioning
confidence: 99%
“…There are various methods for learning quantized latent features and in this study, we focus on two popular approaches: quantized latent features can be learnt through autoencoding, which reconstructs the original signal, either the raw waveform or spectrogram features [4,6]. Another approach learns latent features through predicting representations of future time-steps [7,8,9,10]. * Work done during a Facebook AI Residency.…”
Section: Introductionmentioning
confidence: 99%
“…Furthermore, the investigation of these pre-trained vector representations, in [22], shows that the pre-trained quantized vector representations of vowel phonemes have locations in latent space similar to those shown in the IPA vowel chart defined by human experts. This means that these latent representations are capable of representing speech data phonetically, and therefore can be used in end-to-end ASR systems as well as other non-ASR applications such as pronunciation and fluency assessment.…”
Section: Related Workmentioning
confidence: 89%
“…In the future, we will try to improve the TSA so that it can be used in current utterance, and try to use some unsupervised/semisupervised methods [28,29] to extract the word embedding better.…”
Section: Discussionmentioning
confidence: 99%