Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-480
|View full text |Cite
|
Sign up to set email alerts
|

Compressed Time Delay Neural Network for Small-Footprint Keyword Spotting

Abstract: In this paper we investigate a time delay neural network (TDNN) for a keyword spotting task that requires low CPU, memory and latency. The TDNN is trained with transfer learning and multi-task learning. Temporal subsampling enabled by the time delay architecture reduces computational complexity. We propose to apply singular value decomposition (SVD) to further reduce TDNN complexity. This allows us to first train a larger full-rank TDNN model which is not limited by-CPU/memory constraints. The larger TDNN usua… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
69
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 107 publications
(69 citation statements)
references
References 34 publications
0
69
0
Order By: Relevance
“…It is widely used for hands-free control of mobile applications. Since its use is commonly concentrated on recognizing wake-up words (e.g., "Hey Siri" [1], "Alexa" [2,3], and "Okay Google" [4]) or distinguishing common commands (e.g., "yes" or "no") on mobile devices, the response of KWS should be both immediate and accurate. However, it is challenging to implement fast and accurate KWS models that meet the real-time constraint on mobile devices with restricted hardware resources.…”
Section: Introductionmentioning
confidence: 99%
“…It is widely used for hands-free control of mobile applications. Since its use is commonly concentrated on recognizing wake-up words (e.g., "Hey Siri" [1], "Alexa" [2,3], and "Okay Google" [4]) or distinguishing common commands (e.g., "yes" or "no") on mobile devices, the response of KWS should be both immediate and accurate. However, it is challenging to implement fast and accurate KWS models that meet the real-time constraint on mobile devices with restricted hardware resources.…”
Section: Introductionmentioning
confidence: 99%
“…Eqs. (3) to (7) (with the number of positive targets n=1) and (8) to (9) define the loss for the decoder submodel. ] includes actual end-point of the keyword.…”
Section: Smoothed Max Pooling Loss For Decodermentioning
confidence: 99%
“…one-hot vector) without considering them as sequences of characters. Other recent works aim to spot specific keywords used to activate voice assistant systems [29,30,31]. The application of BiLSTMs on KWS was first proposed in [32].…”
Section: Related Workmentioning
confidence: 99%