High-performance Query-by-Example Spoken Term Detection on the SWS 2013 evaluation

Rodríguez-Fuentes, Luis Javier; Varona, Amparo; Peñagarikano, Mikel; Bordel, Germán; Díez, Mireia

doi:10.1109/icassp.2014.6855122

Cited by 67 publications

(80 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, dimensionality reduction can be applied to reduce the stacked feature vector. We will also compare our method with the latest DTW QbyE systems, as described in [25,27].…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Query-by-example keyword spotting using long short-term memory networks

Chen

Parada

Sainath

2015

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

135

View full text Add to dashboard Cite

We present a novel approach to query-by-example keyword spotting (KWS) using a long short-term memory (LSTM) recurrent neural network-based feature extractor. In our approach, we represent each keyword using a fixed-length feature vector obtained by running the keyword audio through a word-based LSTM acoustic model. We use the activations prior to the softmax layer of the LSTM as our keyword-vector. At runtime, we detect the keyword by extracting the same feature vector from a sliding window and computing a simple similarity score between this test vector and the keyword vector. With clean speech, we achieve 86% relative false rejection rate reduction at 0.5% false alarm rate when compared to a competitive phoneme posteriorgram with dynamic time warping KWS system, while the reduction in the presence of babble noise is 67%. Our system has a small memory footprint, low computational cost, and high precision, making it suitable for on-device applications.

show abstract

“…For example, dimensionality reduction can be applied to reduce the stacked feature vector. We will also compare our method with the latest DTW QbyE systems, as described in [25,27].…”

Section: Discussionmentioning

confidence: 99%

“…In our experiments we ignore this effect and simply choose the first template randomly. Another option is to choose the longest template as the first one, as proposed in [25]. Table 1 lists keywords used in our experiments.…”

Section: Template Averagingmentioning

confidence: 99%

Query-by-example keyword spotting using long short-term memory networks

Chen

Parada

Sainath

2015

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

135

View full text Add to dashboard Cite

show abstract

“…Nevertheless, exemplar-based speech processing faces two fundamental problems: (1) The growing size of the databases prohibits efficient search, and (2) The duration variation in speech pronunciation is effectively handled via dynamic time warping that is computationally expensive and sub-optimal due to dependency on the local reference exemplar. This paper addresses these limitations to foster exemplar based solutions for real time applications.…”

Section: State-of-the-art Solutions and Challengesmentioning

confidence: 99%

“…QbE-STD received serious consideration in the context of MediaEval spoken query search benchmarking campaign [1,2,3]. Recent exemplar based speech processing offers high flexibility in speech applications, partly attributed to the lack of complex statistical assumptions that facilitate exploiting "data deluge" with no prejudice on expected answers.…”

Section: State-of-the-art Solutions and Challengesmentioning

confidence: 99%

Phonological Posterior Hashing for Query by Example Spoken Term Detection

Asaei¹,

Ram²,

Bourlard³

2018

Interspeech 2018

View full text Add to dashboard Cite

State of the art query by example spoken term detection (QbE-STD) systems in zero-resource conditions rely on representation of speech in terms of sequences of class-conditional posterior probabilities estimated by deep neural network (DNN). The posteriors are often used for pattern matching or dynamic time warping (DTW). Exploiting posterior probabilities as speech representation propounds diverse advantages in a classification system. One key property of the posterior representations is that they admit a highly effective hashing strategy that enables indexing a large audio archive in divisions for reducing the search complexity. Moreover, posterior indexing leads to a compressed representation and enables pronunciation dewarping and partial detection with no need for DTW. We exploit these characteristics of the posterior space in the context of redundant hash addressing for query-by-example spoken term detection (QbE-STD). We evaluate the QbE-STD system on AMI corpus and demonstrate that tremendous speedup and superior accuracy is achieved compared to the state-of-the-art pattern matching solution based on DTW. The system has the potential to enable massively large scale spoken query detection.

show abstract

“…The DTW algorithm is a dynamic programming technique to compute the distance between two sequences of spectral vectors of arbitary length, and is commonly applied in query-by-example spoken term detection and other data mining tasks (Rodriguez-Fuentes et al, 2014;Keogh and Ratanamahatana, 2005). Being a non-parametric approach, it is well-suited for limited-or zero-resource tasks (Versteegh et al, 2015).…”

Section: Dtw Systemmentioning

confidence: 99%