Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1003
|View full text |Cite
|
Sign up to set email alerts
|

Streaming Keyword Spotting on Mobile Devices

Abstract: In this work we explore the latency and accuracy of keyword spotting (KWS) models in streaming and non-streaming modes on mobile phones. NN model conversion from non-streaming mode (model receives the whole input sequence and then returns the classification result) to streaming mode (model receives portion of the input sequence and classifies it incrementally) may require manual model rewriting. We address this by designing a Tensorflow/Keras based library which allows automatic conversion of non-streaming mod… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
69
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 85 publications
(69 citation statements)
references
References 21 publications
0
69
0
Order By: Relevance
“…The searching and evaluation are proxyless either on V1 or V2 dataset. We refrain from using techniques like SpecAugment [27] and self-attention [28] to have a fair comparison with other prior arts. Notice that with these tricks, the performance can be further boosted.…”
Section: Searching Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…The searching and evaluation are proxyless either on V1 or V2 dataset. We refrain from using techniques like SpecAugment [27] and self-attention [28] to have a fair comparison with other prior arts. Notice that with these tricks, the performance can be further boosted.…”
Section: Searching Resultsmentioning
confidence: 99%
“…It is although interesting to see all three DARTS methods agree to have a skip connection in the last layer. We take NoisyDARTS model trained on V1 dataset and plot the ROC curve with false negative rate vs. false positive rate in Figure 4, in comparison with MHAtt-RNN and TC-ResNet-365K (boosted version with 365K parameters by [28]). It indicates that NoisyDARTS-TC-14 and MHAtt-RNN are close, while both outperforming TC-ResNet-365K.…”
Section: Searching Resultsmentioning
confidence: 99%
“…To this end, we extend the Streaming-aware Neural Network [2] to further support three types of operations: strided convolutions, transposed convolutions, and convolutions with shortcut connections. While the original framework already supports plain convolutions, this extension enables streaming convolutions for U-Net architectures.…”
Section: Real-time Processing Of Streaming Inputmentioning
confidence: 99%
“…In addition, one of our main contributions is to explicitly address the problem of deploying SEANet on a mobile device, aiming at low latency. This is inspired by previous work on streaming architectures for keyword spotting [2], that we extended to support the operations needed to deploy a UNet generator. With this solution, we are able to process each 16ms audio frame in ∼1.5ms on the CPU of a mobile device, so that the total latency is ∼17.5 ms.…”
Section: Introductionmentioning
confidence: 99%
“…Commands to control applications and services include "play the music," "turn off," and "how is the weather tomorrow?" While the applicability of neural networks to KWS has been demonstrated, recent studies have pursued performance improvement and reduction in the number of parameters [18]- [20], and other studies have focused on improving the realtime KWS performance [12], [21].…”
Section: Introductionmentioning
confidence: 99%