Unsupervised Bottleneck Features for Low-Resource Query-by-Example Spoken Term Detection

Chen, Hongjie; Leung, Cheung-Chi; Xie, Lei; Ma, Bin; Li, Haizhou

doi:10.21437/interspeech.2016-313

Cited by 41 publications

(28 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As in [3,6,10], three different evaluation metrics are used for QbE speech search: 1) mean average precision (MAP), which is the mean of average precision for each query on search content. 2) Precision of the top N utterances in the test set (P@N), where N is the number of target utterances involving the query term.…”

Section: Methodsmentioning

confidence: 99%

Learning Acoustic Word Embeddings with Temporal Context for Query-by-Example Speech Search

et al. 2018

Self Cite

View full text Add to dashboard Cite

We propose to learn acoustic word embeddings with temporal context for query-by-example (QbE) speech search. The temporal context includes the leading and trailing word sequences of a word. We assume that there exist spoken word pairs in the training database. We pad the word pairs with their original temporal context to form fixed-length speech segment pairs. We obtain the acoustic word embeddings through a deep convolutional neural network (CNN) which is trained on the speech segment pairs with a triplet loss. By shifting a fixed-length analysis window through the search content, we obtain a running sequence of embeddings. In this way, searching for the spoken query is equivalent to the matching of acoustic word embeddings. The experiments show that our proposed acoustic word embeddings learned with temporal context are effective in QbE speech search. They outperform the state-of-the-art frame-level feature representations and reduce run-time computation since no dynamic time warping is required in QbE speech search. We also find that it is important to have sufficient speech segment pairs to train the deep CNN for effective acoustic word embeddings. Index Terms: acoustic word embeddings, word pairs, temporal context, triplet loss, query-by-example spoken term detection

show abstract

Section: Methodsmentioning

confidence: 99%

Learning Acoustic Word Embeddings with Temporal Context for Query-by-Example Speech Search

et al. 2018

Self Cite

View full text Add to dashboard Cite

show abstract

“…Regarding the features used for query/utterance representation, [5,[13][14][15] employ Gaussian posteriorgrams; [16] proposes an i-vector-based approach for feature extraction; [17] uses phone log-likelihood ratio-based features; [18] employs posteriorgrams derived from various unsupervised tokenizers, supervised tokenizers, and semi-supervised tokenizers; [19] employs posteriorgrams derived from a Gaussian mixture model (GMM) tokenizer, phoneme recognition, and acoustic segment modelling; [11,15,[20][21][22][23][24][25][26] use phoneme posteriorgrams; [11,[27][28][29] employ bottleneck features; [30] employs posteriorgrams from non-parametric Bayesian models; [31] employs articulatory class-based posteriorgrams; [32] proposes an intrinsic spectral analysis; and [33] is based on the unsupervised segment-based bag of an acoustic words framework. All these studies employ the standard DTW algorithm for query search, except for [13], which employs the NS-DTW algorithm, [15,24,25,28,30], which employ the subsequence DTW (S-DTW) algorithm, [14], which presents a variant of the S-DTW algorithm, and [26], which employs the segmental DTW algorithm.…”

Section: Methods Based On Template Matching Of Featuresmentioning

confidence: 99%

ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluation

Tejedor

Toledano

López-Otero

et al. 2018

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Query-by-example Spoken Term Detection (QbE STD) aims to retrieve data from a speech repository given an acoustic (spoken) query containing the term of interest as the input. This paper presents the systems submitted to the ALBAYZIN QbE STD 2016 Evaluation held as a part of the ALBAYZIN 2016 Evaluation Campaign at the IberSPEECH 2016 conference. Special attention was given to the evaluation design so that a thorough post-analysis of the main results could be carried out. Two different Spanish speech databases, which cover different acoustic and language domains, were used in the evaluation: the MAVIR database, which consists of a set of talks from workshops, and the EPIC database, which consists of a set of European Parliament sessions in Spanish. We present the evaluation design, both databases, the evaluation metric, the systems submitted to the evaluation, the results, and a thorough analysis and discussion. Four different research groups participated in the evaluation, and a total of eight template matching-based systems were submitted. We compare the systems submitted to the evaluation and make an in-depth analysis based on some properties of the spoken queries, such as query length, single-word/multi-word queries, and in-language/out-of-language queries.

show abstract

“…UAM is a challenging problem with significant practical impact in speech as well as linguistics and cognitive science communities. It has been studied in applications such as ASR for low-resource languages [1], language identification [2] and query-by-example spoken term detection [3]. This problem is also relevant to endangered language protection [4] and understanding infants' language acquisition mechanism [5].…”

Section: Introductionmentioning

confidence: 99%

Improving Unsupervised Subword Modeling via Disentangled Speech Representation Learning and Transformation

Feng

Lee

2019

Interspeech 2019

View full text Add to dashboard Cite

This study tackles unsupervised subword modeling in the zeroresource scenario, learning frame-level speech representation that is phonetically discriminative and speaker-invariant, using only untranscribed speech for target languages. Frame label acquisition is an essential step in solving this problem. High quality frame labels should be in good consistency with golden transcriptions and robust to speaker variation. We propose to improve frame label acquisition in our previously adopted deep neural network-bottleneck feature (DNN-BNF) architecture by applying the factorized hierarchical variational autoencoder (FHVAE). FHVAEs learn to disentangle linguistic content and speaker identity information encoded in speech. By discarding or unifying speaker information, speaker-invariant features are learned and fed as inputs to DPGMM frame clustering and DNN-BNF training. Experiments conducted on ZeroSpeech 2017 show that our proposed approaches achieve 2.4% and 0.6% absolute ABX error rate reductions in acrossand within-speaker conditions, comparing to the baseline DNN-BNF system without applying FHVAEs. Our proposed approaches significantly outperform vocal tract length normalization in improving frame labeling and subword modeling.

show abstract

Unsupervised Bottleneck Features for Low-Resource Query-by-Example Spoken Term Detection

Cited by 41 publications

References 19 publications

Learning Acoustic Word Embeddings with Temporal Context for Query-by-Example Speech Search

Learning Acoustic Word Embeddings with Temporal Context for Query-by-Example Speech Search

ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluation

Improving Unsupervised Subword Modeling via Disentangled Speech Representation Learning and Transformation

Contact Info

Product

Resources

About