An auto-encoder based approach to unsupervised learning of subword units

Badino, Leonardo; Canevari, Claudia; Fadiga, Luciano; Metta, Giorgio

doi:10.1109/icassp.2014.6855085

Cited by 66 publications

(54 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The stacked network is trained one layer at a time, each layer minimizing the loss of its output with respect to its input. A number of studies have shown that hidden representations from an intermediate layer in such a stacked AE are useful as features in speech applications [31,[33][34][35][36][37][38].…”

Section: Autoencoder Featuresmentioning

confidence: 99%

Feature Exploration for Almost Zero-Resource ASR-Free Keyword Spotting Using a Multilingual Bottleneck Extractor and Correspondence Autoencoders

et al. 2019

View full text Add to dashboard Cite

We compare features for dynamic time warping (DTW) when used to bootstrap keyword spotting (KWS) in an almost zeroresource setting. Such quickly-deployable systems aim to support United Nations (UN) humanitarian relief efforts in parts of Africa with severely under-resourced languages. Our objective is to identify acoustic features that provide acceptable KWS performance in such environments. As supervised resource, we restrict ourselves to a small, easily acquired and independently compiled set of isolated keywords. For feature extraction, a multilingual bottleneck feature (BNF) extractor, trained on well-resourced out-of-domain languages, is integrated with a correspondence autoencoder (CAE) trained on extremely sparse in-domain data. On their own, BNFs and CAE features are shown to achieve a more than 2% absolute performance improvement over baseline MFCCs. However, by using BNFs as input to the CAE, even better performance is achieved, with a more than 11% absolute improvement in ROC AUC over MFCCs and more than twice as many top-10 retrievals for two evaluated languages, English and Luganda. We conclude that integrating BNFs with the CAE allows both large out-of-domain and sparse in-domain resources to be exploited for improved ASR-free keyword spotting.

show abstract

Section: Autoencoder Featuresmentioning

confidence: 99%

Feature Exploration for Almost Zero-Resource ASR-Free Keyword Spotting Using a Multilingual Bottleneck Extractor and Correspondence Autoencoders

et al. 2019

View full text Add to dashboard Cite

show abstract

“…By presenting the same data at the input and the output of the network while constraining intermediate connections, the network is trained to find an internal representation that is useful for reconstruction. These internal representations can be useful as features [36][37][38][39][40][41]. Like BNFs, autoencoders can be trained on languages different from the target language (often resulting in more data to train on).…”

Section: Autoencoder Featuresmentioning

confidence: 99%

ASR-Free CNN-DTW Keyword Spotting Using Multilingual Bottleneck Features for Almost Zero-Resource Languages

Menon

Kamper

Yılmaz

et al. 2018

6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018)

View full text Add to dashboard Cite

We consider multilingual bottleneck features (BNFs) for nearly zero-resource keyword spotting. This forms part of a United Nations effort using keyword spotting to support humanitarian relief programmes in parts of Africa where languages are severely under-resourced. We use 1920 isolated keywords (40 types, 34 minutes) as exemplars for dynamic time warping (DTW) template matching, which is performed on a much larger body of untranscribed speech. These DTW costs are used as targets for a convolutional neural network (CNN) keyword spotter, giving a much faster system than direct DTW. Here we consider how available data from well-resourced languages can improve this CNN-DTW approach. We show that multilingual BNFs trained on ten languages improve the area under the ROC curve of a CNN-DTW system by 10.9% absolute relative to the MFCC baseline. By combining low-resource DTW-based supervision with information from well-resourced languages, CNN-DTW is a competitive option for low-resource keyword spotting.

show abstract

“…Features should ideally disregard irrelevant information (such as speaker and gender), while capturing linguistically meaningful contrasts (such as phone or word categories). Several different unsupervised frame-level acoustic feature learning methods have been developed over the last few years [6]- [12], with neural networks being used in a number of studies [13]- [17].…”

Section: Introductionmentioning

confidence: 99%

Unsupervised Feature Learning for Speech Using Correspondence and Siamese Networks

Engelbrecht

Kamper

2020

IEEE Signal Process. Lett.

View full text Add to dashboard Cite

In zero-resource settings where transcribed speech audio is unavailable, unsupervised feature learning is essential for downstream speech processing tasks. Here we compare two recent methods for frame-level acoustic feature learning. For both methods, unsupervised term discovery is used to find pairs of word examples of the same unknown type. Dynamic programming is then used to align the feature frames between each word pair, serving as weak top-down supervision for the two models. For the correspondence autoencoder (CAE), matching frames are presented as input-output pairs. The Triamese network uses a contrastive loss to reduce the distance between frames of the same predicted word type while increasing the distance between negative examples. For the first time, these feature extractors are compared on the same discrimination tasks using the same weak supervision pairs. We find that, on the two datasets considered here, the CAE outperforms the Triamese network. However, we show that a new hybrid correspondence-Triamese approach (CTriamese), consistently outperforms both the CAE and Triamese models in terms of average precision and ABX error rates on both English and Xitsonga evaluation data.

show abstract

An auto-encoder based approach to unsupervised learning of subword units

Cited by 66 publications

References 8 publications

Feature Exploration for Almost Zero-Resource ASR-Free Keyword Spotting Using a Multilingual Bottleneck Extractor and Correspondence Autoencoders

Feature Exploration for Almost Zero-Resource ASR-Free Keyword Spotting Using a Multilingual Bottleneck Extractor and Correspondence Autoencoders

ASR-Free CNN-DTW Keyword Spotting Using Multilingual Bottleneck Features for Almost Zero-Resource Languages

Unsupervised Feature Learning for Speech Using Correspondence and Siamese Networks

Contact Info

Product

Resources

About