2009 IEEE Workshop on Automatic Speech Recognition &Amp; Understanding 2009
DOI: 10.1109/asru.2009.5372889
|View full text |Cite
|
Sign up to set email alerts
|

Query-by-example spoken term detection using phonetic posteriorgram templates

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

2
173
0

Year Published

2011
2011
2018
2018

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 242 publications
(175 citation statements)
references
References 9 publications
2
173
0
Order By: Relevance
“…Like the phonetic posteriorgrams used in [11,12], a supervised or semi-supervised DBM posteriorgram is a probability vector representing the posterior probabilities of a set of labeled phonetic units for a speech frame. Formally, if we denotes N speech frames as x 1 , .…”
Section: Semi-supervised Dbm Posteriorgrammentioning
confidence: 99%
“…Like the phonetic posteriorgrams used in [11,12], a supervised or semi-supervised DBM posteriorgram is a probability vector representing the posterior probabilities of a set of labeled phonetic units for a speech frame. Formally, if we denotes N speech frames as x 1 , .…”
Section: Semi-supervised Dbm Posteriorgrammentioning
confidence: 99%
“…In this case, the problem is known as Query-by-Example Spoken Term Detection (QbE-STD) and both the input query and the collection of documents are acoustic signals. Both of these tasks have been studied lately [1][2][3][4], and some examples of their interest and importance are the evaluation campaigns carried out in this line, such as the one organized by NIST in 2006 [5] and the MediaEval evaluations [6]. In this paper, we will focus on the Query-by-Example Spoken Term Detection task.…”
Section: Introductionmentioning
confidence: 99%
“…The feature vectors are usually a standard parametrization of the acoustic signal, for example based on cepstrals. Also, in the recent literature one of the most usual algorithms to perform this search is Segmental Dynamic Time Warping (SDTW) [1][2][3][4].…”
Section: Introductionmentioning
confidence: 99%
“…Our method only requires a forward pass computation of the neural network, followed by a vector distance computation, and therefore is more efficient than [15] where an LVCSR is involved and [17] where multiple DTW computations are necessary. It also requires less computation than [18,19] since vector distance is used instead of DTW.…”
Section: Introductionmentioning
confidence: 99%
“…In [17] graph-based method is proposed to embed audio segments into a fixed-dimensional space, but dynamic time warping (DTW) is performed between the test audio segment and all the training segments in order to compute the embedding, which can be slow given large number of training segments. In [18,19], Gaussian or phoneme posteriorgrams are generated as templates from example keywords, and DTW is used to compare the templates. Though this type of DTW-based methods have well-known inadequacies [17], it is the most appropriate KWS baseline for our application.…”
Section: Introductionmentioning
confidence: 99%