Deep learning methods for unsupervised acoustic modeling — Leap submission to ZeroSpeech challenge 2017

Ansari, T K; Kumar, Rajath; Singh, Sonali; Ganapathy, Sriram

doi:10.1109/asru.2017.8269013

Cited by 13 publications

(13 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[59] first clustered speech frames, then trained a neural network to predict the cluster IDs and used its hidden representation as features. [60] extended this scheme with features discovered by an autoencoder trained on MFCCs.…”

Section: Related Workmentioning

confidence: 99%

Unsupervised Speech Representation Learning Using WaveNet Autoencoders

Chorowski

Weiss

Bengio

et al. 2019

IEEE/ACM Trans. Audio Speech Lang. Process.

294

300

View full text Add to dashboard Cite

We consider the task of unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms. The goal is to learn a representation able to capture high level semantic content from the signal, e.g. phoneme identities, while being invariant to confounding low level details in the signal such as the underlying pitch contour or background noise. Since the learned representation is tuned to contain only phonetic content, we resort to using a high capacity WaveNet decoder to infer information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the kind of constraint that is applied to the latent representation. We compare three variants: a simple dimensionality reduction bottleneck, a Gaussian Variational Autoencoder (VAE), and a discrete Vector Quantized VAE (VQ-VAE). We analyze the quality of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately reconstruct individual spectrogram frames. Moreover, for discrete encodings extracted using the VQ-VAE, we measure the ease of mapping them to phonemes. We introduce a regularization scheme that forces the representations to focus on the phonetic content of the utterance and report performance comparable with the top entries in the ZeroSpeech 2017 unsupervised acoustic unit discovery task.Index Terms-autoencoder, speech representation learning, unsupervised learning, acoustic unit discovery J. Chorowski is with the

show abstract

Section: Related Workmentioning

confidence: 99%

Unsupervised Speech Representation Learning Using WaveNet Autoencoders

Chorowski

Weiss

Bengio

et al. 2019

IEEE/ACM Trans. Audio Speech Lang. Process.

294

300

View full text Add to dashboard Cite

show abstract

“…A DNN model is typically trained using available speech data. The learned features are obtained either from a designated low-dimension hidden layer of the DNN, known as the bottleneck features (BNFs) [12], or from the softmax output layer, known as the posterior features or posteriorgram [13]. To facilitate supervised training of the DNN, target labels of training speech are needed.…”

mentioning

confidence: 99%

“…One of the possible approaches is based on unsupervised clustering of training speech. The frame-level cluster indices can be used as target labels for DNN training [11]- [13]. Another approach seeks to use pre-trained outof-domain ASR systems to tokenize untranscribed in-domain speech and hence each frame is assigned with an ASR senone label [5], [14].…”

mentioning

confidence: 99%

Exploiting Cross-Lingual Speaker and Phonetic Diversity for Unsupervised Subword Modeling

Feng

Lee

2019

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

This research addresses the problem of acoustic modeling of low-resource languages for which transcribed training data is absent. The goal is to learn robust frame-level feature representations that can be used to identify and distinguish subword-level speech units. The proposed feature representations comprise various types of multilingual bottleneck features (BNFs) that are obtained via multi-task learning of deep neural networks (MTL-DNN). One of the key problems is how to acquire highquality frame labels for untranscribed training data to facilitate supervised DNN training. It is shown that learning of robust BNF representations can be achieved by effectively leveraging transcribed speech data and well-trained automatic speech recognition (ASR) systems from one or more out-of-domain (resourcerich) languages. Out-of-domain ASR systems can be applied to perform speaker adaptation with untranscribed training data of the target language, and to decode the training speech into framelevel labels for DNN training. It is also found that better frame labels can be generated by considering temporal dependency in speech when performing frame clustering. The proposed methods of feature learning are evaluated on the standard task of unsupervised subword modeling in Track 1 of the ZeroSpeech 2017 Challenge. The best performance achieved by our system is 9.7% in terms of across-speaker triphone minimal-pair ABX error rate, which is comparable to the best systems reported recently. Lastly, our investigation reveals that the closeness between target languages and out-of-domain languages and the amount of available training data for individual target languages could have significant impact on the goodness of learned features.

show abstract

“…Thus, the F -scores presented are not directly related to the precision and recall scores. Ansari et al [25] combine two sets of features, also trained on all five languages. The first set of features is always a high-dimensional hidden layer from an autoencoder trained on the MFCC frames.…”

Section: Trackmentioning

confidence: 99%

The zero resource speech challenge 2017

Dunbar

Cao

Benjumea

et al. 2017

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

157

219

View full text Add to dashboard Cite

We describe a new challenge aimed at discovering subword and word units from raw speech. This challenge is the followup to the Zero Resource Speech Challenge 2015. It aims at constructing systems that generalize across languages and adapt to new speakers. The design features and evaluation metrics of the challenge are presented and the results of seventeen models are discussed.Index Terms-zero resource speech technology, subword modeling, acoustic unit discovery, unsupervised term discovery

show abstract

Deep learning methods for unsupervised acoustic modeling — Leap submission to ZeroSpeech challenge 2017

Cited by 13 publications

References 8 publications

Unsupervised Speech Representation Learning Using WaveNet Autoencoders

Unsupervised Speech Representation Learning Using WaveNet Autoencoders

Exploiting Cross-Lingual Speaker and Phonetic Diversity for Unsupervised Subword Modeling

The zero resource speech challenge 2017

Contact Info

Product

Resources

About