Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1010
|View full text |Cite
|
Sign up to set email alerts
|

Learning Acoustic Word Embeddings with Temporal Context for Query-by-Example Speech Search

Abstract: We propose to learn acoustic word embeddings with temporal context for query-by-example (QbE) speech search. The temporal context includes the leading and trailing word sequences of a word. We assume that there exist spoken word pairs in the training database. We pad the word pairs with their original temporal context to form fixed-length speech segment pairs. We obtain the acoustic word embeddings through a deep convolutional neural network (CNN) which is trained on the speech segment pairs with a triplet los… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
27
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 31 publications
(27 citation statements)
references
References 22 publications
(40 reference statements)
0
27
0
Order By: Relevance
“…Several supervised and unsupervised acoustic embedding methods have been proposed. Supervised methods include convolutional [11][12][13] and recurrent neural network (RNN) models [14][15][16][17], trained with discriminative classification and contrastive losses. Unsupervised methods include using distances to a fixed reference set [10] and unsupervised autoencoding RNNs [18][19][20].…”
Section: Introductionmentioning
confidence: 99%
“…Several supervised and unsupervised acoustic embedding methods have been proposed. Supervised methods include convolutional [11][12][13] and recurrent neural network (RNN) models [14][15][16][17], trained with discriminative classification and contrastive losses. Unsupervised methods include using distances to a fixed reference set [10] and unsupervised autoencoding RNNs [18][19][20].…”
Section: Introductionmentioning
confidence: 99%
“…A less direct approach consists of replicating the standard approach of natural language processing (NLP) of representing a word with a fixed-length vector (embedding). In [120,[132][133][134], this is extended by obtaining the word embedding directly from the audio. Once the embeddings are obtained, matching words is trivial and can be done using nearest neighbours [132].…”
Section: Query-by-example Spoken Term Detectionmentioning
confidence: 99%
“…We use the same model in this work and expand it towards learning acoustic embeddings. [5,6,8,7,26,9] all explore ways to learn acoustic word embeddings. All above methods except [7] use unsupervised learning based methods to obtain these embeddings where they do not use the transcripts or do not perform speech recognition.…”
Section: Related Workmentioning
confidence: 99%