2020
DOI: 10.1109/access.2020.2999055
|View full text |Cite
|
Sign up to set email alerts
|

Learning Acoustic Word Embeddings With Dynamic Time Warping Triplet Networks

Abstract: In the last years, acoustic word embeddings (AWEs) have gained significant interest in the research community. It applies specifically to the application of acoustic embeddings in the Query-by-Example Spoken Term Detection (QbE-STD) search and related word discrimination tasks. It has been shown that AWEs learned for the word or phone classification in one or several languages can outperform approaches that use dynamic time warping (DTW). In this paper, a new method of learning AWEs in the DTW framework is pro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
1

Relationship

2
2

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 24 publications
0
4
0
Order By: Relevance
“…Additionally, the study proposes to extract acoustic features using the acoustic word embedding (AWE) model [22]. This model was trained to discriminate between different words and allows for compact encoding of acoustics while preserving contextual information of the input.…”
Section: Contributionsmentioning
confidence: 99%
See 2 more Smart Citations
“…Additionally, the study proposes to extract acoustic features using the acoustic word embedding (AWE) model [22]. This model was trained to discriminate between different words and allows for compact encoding of acoustics while preserving contextual information of the input.…”
Section: Contributionsmentioning
confidence: 99%
“…To address the problem of inter and intra-speaker variability of speech representation, the acoustic word embedding (AWE) model [22] was employed to obtain embeddings from MFCCs. These embeddings were then used as an acoustic representation of speech in the learning algorithm.…”
Section: Acoustic Representationmentioning
confidence: 99%
See 1 more Smart Citation
“…We used deep network embeddings as inputs to the unsupervised clustering module. Network embeddings extracted from pre-trained CNN models have been shown to provide excellent performance in the unsupervised classification of natural images [41] and speech synthesis [42], outperforming classical image features.…”
Section: Feature Extractionmentioning
confidence: 99%