Compressed Time Delay Neural Network for Small-Footprint Keyword Spotting

Sun, Ming; Snyder, David; Gao, Yixin; Nagaraja, Varun; Rodehorst, Mike; Panchapagesan, Sankaran; Ström, Nikko; Matsoukas, Spyros; Vitaladevuni, Shiv

doi:10.21437/interspeech.2017-480

Cited by 107 publications

(69 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is widely used for hands-free control of mobile applications. Since its use is commonly concentrated on recognizing wake-up words (e.g., "Hey Siri" [1], "Alexa" [2,3], and "Okay Google" [4]) or distinguishing common commands (e.g., "yes" or "no") on mobile devices, the response of KWS should be both immediate and accurate. However, it is challenging to implement fast and accurate KWS models that meet the real-time constraint on mobile devices with restricted hardware resources.…”

Section: Introductionmentioning

confidence: 99%

Temporal Convolution for Real-Time Keyword Spotting on Mobile Devices

Choi¹,

Seo²,

Shin³

et al. 2019

Interspeech 2019

115

View full text Add to dashboard Cite

Keyword spotting (KWS) plays a critical role in enabling speech-based user interactions on smart devices. Recent developments in the field of deep learning have led to wide adoption of convolutional neural networks (CNNs) in KWS systems due to their exceptional accuracy and robustness. The main challenge faced by KWS systems is the trade-off between high accuracy and low latency. Unfortunately, there has been little quantitative analysis of the actual latency of KWS models on mobile devices. This is especially concerning since conventional convolution-based KWS approaches are known to require a large number of operations to attain an adequate level of performance.In this paper, we propose a temporal convolution for real-time KWS on mobile devices. Unlike most of the 2D convolution-based KWS approaches that require a deep architecture to fully capture both low-and high-frequency domains, we exploit temporal convolutions with a compact ResNet architecture. In Google Speech Command Dataset, we achieve more than 385x speedup on Google Pixel 1 and surpass the accuracy compared to the state-of-the-art model. In addition, we release the implementation of the proposed and the baseline models including an end-to-end pipeline for training models and evaluating them on mobile devices.

show abstract

Section: Introductionmentioning

confidence: 99%

Temporal Convolution for Real-Time Keyword Spotting on Mobile Devices

Choi¹,

Seo²,

Shin³

et al. 2019

Interspeech 2019

115

View full text Add to dashboard Cite

show abstract

“…Eqs. (3) to (7) (with the number of positive targets n=1) and (8) to (9) define the loss for the decoder submodel. ] includes actual end-point of the keyword.…”

Section: Smoothed Max Pooling Loss For Decodermentioning

confidence: 99%

Learning to Detect Keyword Parts and Whole by Smoothed Max Pooling

Park

Violette

Subrahmanya

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We propose smoothed max pooling loss and its application to keyword spotting systems. The proposed approach jointly trains an encoder (to detect keyword parts) and a decoder (to detect whole keyword) in a semi-supervised manner. The proposed new loss function allows training a model to detect parts and whole of a keyword, without strictly depending on frame-level labeling from LVCSR (Large vocabulary continuous speech recognition), making further optimization possible. The proposed system outperforms the baseline keyword spotting model in [1] due to increased optimizability. Further, it can be more easily adapted for on-device learning applications due to reduced dependency on LVCSR.

show abstract

“…one-hot vector) without considering them as sequences of characters. Other recent works aim to spot specific keywords used to activate voice assistant systems [29,30,31]. The application of BiLSTMs on KWS was first proposed in [32].…”

Section: Related Workmentioning

confidence: 99%

Zero-Shot Keyword Spotting for Visual Speech Recognition In-the-wild

Stafylakis

Tzimiropoulos

2018

Computer Vision – ECCV 2018

View full text Add to dashboard Cite

Visual keyword spotting (KWS) is the problem of estimating whether a text query occurs in a given recording using only video information. This paper focuses on visual KWS for words unseen during training, a real-world, practical setting which so far has received no attention by the community. To this end, we devise an end-to-end architecture comprising (a) a state-of-the-art visual feature extractor based on spatiotemporal Residual Networks, (b) a grapheme-to-phoneme model based on sequence-to-sequence neural networks, and (c) a stack of recurrent neural networks which learn how to correlate visual features with the keyword representation. Different to prior works on KWS, which try to learn word representations merely from sequences of graphemes (i.e. letters), we propose the use of a grapheme-to-phoneme encoder-decoder model which learns how to map words to their pronunciation. We demonstrate that our system obtains very promising visual-only KWS results on the challenging LRS2 database, for keywords unseen during training. We also show that our system outperforms a baseline which addresses KWS via automatic speech recognition (ASR), while it drastically improves over other recently proposed ASR-free KWS methods.

show abstract

Compressed Time Delay Neural Network for Small-Footprint Keyword Spotting

Cited by 107 publications

References 34 publications

Temporal Convolution for Real-Time Keyword Spotting on Mobile Devices

Temporal Convolution for Real-Time Keyword Spotting on Mobile Devices

Learning to Detect Keyword Parts and Whole by Smoothed Max Pooling

Zero-Shot Keyword Spotting for Visual Speech Recognition In-the-wild

Contact Info

Product

Resources

About