Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-1087
|View full text |Cite
|
Sign up to set email alerts
|

A Relevance Score Estimation for Spoken Term Detection Based on RNN-Generated Pronunciation Embeddings

Abstract: In this paper, we present a novel method for term score estimation. The method is primarily designed for scoring the out-of-vocabulary terms, however it could also estimate scores for in-vocabulary results. The term score is computed as a cosine distance of two pronunciation embeddings. The first one is generated from the grapheme representation of the searched term, while the second one is computed from the recognized phoneme confusion network. The embeddings are generated by specifically trained recurrent ne… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
12
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
2
2
2

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(13 citation statements)
references
References 14 publications
(19 reference statements)
1
12
0
Order By: Relevance
“…Since we are targeting the large spoken archives for which the LVCSR system exists and is used for searching the IV terms, we focused on the methods where the STD is performed using the phoneme recognizer (in a structure similar to LVCSR) to search the OOV terms. The idea is not new and we used it in a mostly heuristic search presented in [2] and subsequently we adopted the approach of Siamese networks [17] to robustly estimate the term relevance scores. In this work we modified the Siamese architecture presented in [17] with the goal to simplify the network architecture and further improve the STD performance: STD process -while the original Siamese architecture was proposed to relevance score estimation only and the localization of terms was performed using the index of phoneme triplets, the proposed approach allows both to localize and score the putative hits of the searched term.…”
Section: Introductionmentioning
confidence: 99%
“…Since we are targeting the large spoken archives for which the LVCSR system exists and is used for searching the IV terms, we focused on the methods where the STD is performed using the phoneme recognizer (in a structure similar to LVCSR) to search the OOV terms. The idea is not new and we used it in a mostly heuristic search presented in [2] and subsequently we adopted the approach of Siamese networks [17] to robustly estimate the term relevance scores. In this work we modified the Siamese architecture presented in [17] with the goal to simplify the network architecture and further improve the STD performance: STD process -while the original Siamese architecture was proposed to relevance score estimation only and the localization of terms was performed using the index of phoneme triplets, the proposed approach allows both to localize and score the putative hits of the searched term.…”
Section: Introductionmentioning
confidence: 99%
“…Subword-based ASR, on the other hand, can detect terms even though they are not in the vocabulary of the recognizer (i.e., out-of-vocabulary (OOV) terms). A more robust system could be obtained by combining both approaches [17,24,25,38,39,42,50,[67][68][69][70][71][72][73].…”
Section: Spoken Term Detectionmentioning
confidence: 99%
“…In this direction, several end-to-end ASR-free approaches for STD were proposed [13,[34][35][36]. In addition to exploring neural end-to-end approaches, deep learning is extensively used to extract representations (embeddings) of audio documents and query terms that facilitate the search [20,21,23,25].…”
Section: Spoken Term Detectionmentioning
confidence: 99%
See 1 more Smart Citation
“…On the other hand, the subword-based approach has the unique advantage that it can detect terms that consist of words that are not in the vocabulary of the recognizer, i.e., out-ofvocabulary (OOV) terms. The combination of these two approaches has been proposed in order to exploit the relative advantages of word and subword-based strategies [17,32,33,36,44,[63][64][65][66][67][68][69][70].…”
Section: Spoken Term Detection Overviewmentioning
confidence: 99%