2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2016
DOI: 10.1109/icassp.2016.7472619
|View full text |Cite
|
Sign up to set email alerts
|

Deep convolutional acoustic word embeddings using word-pair side information

Abstract: Recent studies have been revisiting whole words as the basic modelling unit in speech recognition and query applications, instead of phonetic units. Such whole-word segmental systems rely on a function that maps a variable-length speech segment to a vector in a fixed-dimensional space; the resulting acoustic word embeddings need to allow for accurate discrimination between different word types, directly in the embedding space. We compare several old and new approaches in a word discrimination task. Our best ap… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
170
1

Year Published

2016
2016
2020
2020

Publication Types

Select...
4
4
1

Relationship

1
8

Authors

Journals

citations
Cited by 149 publications
(174 citation statements)
references
References 29 publications
3
170
1
Order By: Relevance
“…Several supervised and unsupervised acoustic embedding methods have been proposed. Supervised methods include convolutional [11][12][13] and recurrent neural network (RNN) models [14][15][16][17], trained with discriminative classification and contrastive losses. Unsupervised methods include using distances to a fixed reference set [10] and unsupervised autoencoding RNNs [18][19][20].…”
Section: Introductionmentioning
confidence: 99%
“…Several supervised and unsupervised acoustic embedding methods have been proposed. Supervised methods include convolutional [11][12][13] and recurrent neural network (RNN) models [14][15][16][17], trained with discriminative classification and contrastive losses. Unsupervised methods include using distances to a fixed reference set [10] and unsupervised autoencoding RNNs [18][19][20].…”
Section: Introductionmentioning
confidence: 99%
“…Before the experiment, we implemented the baseline multiview approach [13] and trained it with the model and dataset provided by the authors to verify the performance improvement compared to the single-view approaches [9,10]. Then we established our initial model parameters as the same with the retuned baseline model on the WSJ dataset.…”
Section: Methodsmentioning
confidence: 99%
“…The first task is acoustic word discrimination, where we are given two word segments to determine whether they match or not. This task is equivalent to the objective of the single-view approach and has been used in prior papers [9,10,11,12,14,17]. We regard this task as our main evaluation task for training the proposed and baseline network architectures.…”
Section: Evaluation Tasksmentioning
confidence: 99%
“…Moreover, we will also investigate the use of different types of acoustic embeddings, such as those derived from siamese networks [24], that try to preserve distance of words both semantically and in acoustic space.…”
Section: Discussionmentioning
confidence: 99%