Neural Architecture Search for Keyword Spotting

Mo, Tong; Yu, Yakun; Salameh, Mohammad; Niu, Di; Jui, Shangling

doi:10.21437/interspeech.2020-3132

Cited by 33 publications

(21 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In contrast, the vast majority of previous NAS research has been focused on computer vision applications [21,22,23]. Existing NAS works in the speech community investigated non-TDNN based architectures [24,25,26,27,28,29].…”

Section: Introductionmentioning

confidence: 99%

Neural Architecture Search for LF-MMI Trained Time Delay Neural Networks

Xie

Liu

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Deep neural networks (DNNs) based automatic speech recognition (ASR) systems are often designed using expert knowledge and empirical evaluation. In this paper, a range of neural architecture search (NAS) techniques are used to automatically learn two types of hyperparameters of state-of-the-art factored time delay neural networks (TDNNs): i) the left and right splicing context offsets; and ii) the dimensionality of the bottleneck linear projection at each hidden layer. These include the DARTS method integrating architecture selection with lattice-free MMI (LF-MMI) TDNN training; Gumbel-Softmax and pipelined DARTS reducing the confusion over candidate architectures and improving the generalization of architecture selection; and Penalized DARTS incorporating resource constraints to adjust the trade-off between performance and system complexity. Parameter sharing among candidate architectures allows efficient search over up to 7 28 different TDNN systems. Experiments conducted on the 300-hour Switchboard corpus suggest the auto-configured systems consistently outperform the baseline LF-MMI TDNN systems using manual network design or random architecture search after LHUC speaker adaptation and RNNLM rescoring. Absolute word error rate (WER) reductions up to 1.0% and relative model size reduction of 28% were obtained. Consistent performance improvements were also obtained on a UASpeech disordered speech recognition task using the proposed NAS approaches.

show abstract

Section: Introductionmentioning

confidence: 99%

Neural Architecture Search for LF-MMI Trained Time Delay Neural Networks

Xie

Liu

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…By controlling the standard deviation β of the noise, we can tune the searching algorithm to find a trade-off between the number of skip connections and its overall performance. NoisyDARTS finds the best model out of all three methods on V1 dataset, whose average number of parameters is nearly 8× fewer than the contemporary work NAS2 [20]. With much improved efficiency, it allows us to deploy our models on IoT devices with low computation consumption.…”

Section: Searching Resultsmentioning

confidence: 99%

“…Apart from our previous work NASC [16] adopting our two-stage one-shot NAS approach FairNAS [17] on acoustic scene classification, DARTS [11] has been also applied to speaker recognition in AutoSpeech [18], and to speech recognition in DARTS-ASR [19]. There is a noticeable contemporary work [20] also applying DARTS on KWS. However, due to the complex cell-based network topology, their searched networks might be limited for direct application on smart devices.…”

Section: Neural Architecture Search and Audiomentioning

confidence: 99%

AutoKWS: Keyword Spotting with Differentiable Architecture Search

Zhang

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Smart audio devices are gated by an always-on lightweight keyword spotting program to reduce power consumption. It is however challenging to design models that have both high accuracy and low latency for accurate and fast responsiveness. Many efforts have been made to develop end-to-end neural networks, in which depthwise separable convolutions, temporal convolutions, and LSTMs are adopted as building units. Nonetheless, these networks designed with human expertise may not achieve an optimal trade-off in an expansive search space. In this paper, we propose to leverage recent advances in differentiable neural architecture search to discover more efficient networks. Our searched model attains 97.2% top-1 accuracy on Google Speech Command Dataset v1 with only nearly 100K parameters.

show abstract

“…Moving from fully-connected FFNN to CNN acoustic modeling was a natural step taken back in 2015 [28]. Thanks to exploiting local speech time-frequency correlations, CNNs are able to outperform, with fewer parameters, fully-connected FFNNs for acoustic modeling in deep KWS [28], [32], [72], [86], [96], [117], [122]- [125]. One of the attractive features of CNNs is that the number of multiplications of the model can be easily limited to meet the computational constraints by adjusting different hyperparameters like, e.g., filter striding, and kernel and pooling sizes.…”

Section: B Convolutional Neural Networkmentioning

confidence: 99%

“…Therefore, it is obvious that the non-streaming mode lacks some realism from a practical point of view. Despite this, isolated word classification is considered by a number of deep KWS works, e.g., [16], [30], [32], [48]- [52], [58], [69], [82], [89], [99], [109], [125], [128]- [130]. We believe that this is because of the simpler experimental framework with respect to that of the dynamic or streaming case.…”

Section: A Non-streaming Modementioning

confidence: 99%

Deep Spoken Keyword Spotting: An Overview

López-Espejo¹,

Hansen²,

Jensen³

2022

IEEE Access

View full text Add to dashboard Cite

Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with different purposes like the activation of voice assistants. Prospects suggest a sustained growth in terms of social use of this technology. Thus, it is not surprising that deep KWS has become a hot research topic among speech scientists, who constantly look for KWS performance improvement and computational complexity reduction. This context motivates this paper, in which we conduct a literature review into deep spoken KWS to assist practitioners and researchers who are interested in this technology. Specifically, this overview has a comprehensive nature by covering a thorough analysis of deep KWS systems (which includes speech features, acoustic modeling and posterior handling), robustness methods, applications, datasets, evaluation metrics, performance of deep KWS systems and audio-visual KWS. The analysis performed in this paper allows us to identify a number of directions for future research, including directions adopted from automatic speech recognition research and directions that are unique to the problem of spoken KWS.INDEX TERMS Keyword spotting, deep learning, acoustic model, small footprint, robustness.

show abstract

Neural Architecture Search for Keyword Spotting

Cited by 33 publications

References 41 publications

Neural Architecture Search for LF-MMI Trained Time Delay Neural Networks

Neural Architecture Search for LF-MMI Trained Time Delay Neural Networks

AutoKWS: Keyword Spotting with Differentiable Architecture Search

Deep Spoken Keyword Spotting: An Overview

Contact Info

Product

Resources

About