2021
DOI: 10.1007/978-3-030-87802-3_69
|View full text |Cite
|
Sign up to set email alerts
|

Learning Efficient Representations for Keyword Spotting with Triplet Loss

Abstract: In the past few years, triplet loss-based metric embeddings have become a de-facto standard for several important computer vision problems, most notably, person reidentification. On the other hand, in the area of speech recognition the metric embeddings generated by the triplet loss are rarely used even for classification problems. We fill this gap showing that a combination of two representation learning techniques: a triplet loss-based embedding and a variant of kNN for classification instead of cross-entrop… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
12
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2
2
1

Relationship

0
10

Authors

Journals

citations
Cited by 32 publications
(12 citation statements)
references
References 53 publications
0
12
0
Order By: Relevance
“…(%) Model SCV2 Acc. (%) PANN [11] 90.5 RES-15 [21] 97.0 AST [14] 95.6 ± 0.4 AST [14] 98.1 ± 0.05 ERANN [22] 96. which is consistent to AST. Finally, we set 4 network groups with 2, 2, 6, 2 swin-transformer blocks respectively.…”
Section: Modelmentioning
confidence: 99%
“…(%) Model SCV2 Acc. (%) PANN [11] 90.5 RES-15 [21] 97.0 AST [14] 95.6 ± 0.4 AST [14] 98.1 ± 0.05 ERANN [22] 96. which is consistent to AST. Finally, we set 4 network groups with 2, 2, 6, 2 swin-transformer blocks respectively.…”
Section: Modelmentioning
confidence: 99%
“…Inspired by [37] and [38], EdgeCRNN [39] was proposed, an edgecomputing oriented model of acoustic feature enhancement for keyword spotting. Recently, [40] combined a triplet lossbased embedding and a variant of K-Nearest Neighbor (KNN) for classification. We also evaluated our speech augmentation based unsupervised learning method on this dataset, and compared with other unsupervised approaches, including CPC [23], APC [24] and MPC [25].…”
Section: Related Workmentioning
confidence: 99%
“…"yes", "up", "stop") and the task is to classify these in a 12 or 35 classes setting. The datasets comes pre-partitioned into 35 classes and in order to obtain the 12-classes version, the standard approach [9,20,71] is to keep 10 classes of interest (i.e. "yes", "no", "up", "down", "left", "right", "on", "off", "stop", "go"), place the remaining 25 under the "unknown" class and, introduce a new class "silence" where no spoken word appear is the audio clip.…”
Section: Detailed Experimental Setupmentioning
confidence: 99%