Small-Footprint Keyword Spotting on Raw Audio Data with Sinc-Convolutions

Mittermaier, Simon; Kürzinger, Ludwig; Waschneck, Bernd; Rigoll, Gerhard

doi:10.48550/arxiv.1911.02086

Cited by 5 publications

(19 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since pre-processed data like MFCC features won't be always available, few CNN architectures have been developed to work on raw audio data as input. One of the notable ones is the SCN architecture proposed by Mittermaier et al [11], which uses SincNet [14] and DS convolutions [5] to achieve comparable accuracy to the state-of-the-art TC-ResNet models.…”

Section: Related Workmentioning

confidence: 99%

“…2 shows the respective architectures. Architectures of TC-ResNet8 and SCN adopted from [4] and [11] respectively.…”

Section: Model Architecturesmentioning

confidence: 99%

“…SCN network (Fig. 2) [11] uses rectangular band-pass filters (in the frequency domains) in the first convolutional layer to classify on the input raw audio waveform. This is equivalent to convolving the input signal with parametrized sinc functions (sinc(x) = sin(x)…”

Section: Scn Architecturementioning

confidence: 99%

“…Since only two parameters, the upper and lower cut-off frequencies, are required to define any sinc filter, this leads to a smaller memory footprint. As suggested in [11], a log-compression activation (y = log(abs(x) + 1)) is used after the sinc convolutions.…”

Section: Scn Architecturementioning

confidence: 99%

“…The hyperparameters c, k and s represent the number of output channels, kernel size and stride respectively for all the models. Architectures of TC-ResNet8 and SCN adopted from[4] and[11] respectively.…”

mentioning

confidence: 99%

See 4 more Smart Citations

Behavior of Keyword Spotting Networks Under Noisy Conditions

Mohanty,

Frischknecht,

Gerum

et al. 2021

Preprint

View full text Add to dashboard Cite

Keyword spotting (KWS) is becoming a ubiquitous need with the advancement in artificial intelligence and smart devices. Recent work in this field have focused on several different architectures to achieve good results on datasets with low to moderate noise. However, the performance of these models deteriorates under high noise conditions as shown by our experiments. In our paper, we present an extensive comparison between state-of-the-art KWS networks under various noisy conditions. We also suggest adaptive batch normalization as a technique to improve the performance of the networks when the noise files are unknown during the training phase. The results of such high noise characterization enable future work in developing models that perform better in the aforementioned conditions.

show abstract

Section: Related Workmentioning

confidence: 99%

“…2 shows the respective architectures. Architectures of TC-ResNet8 and SCN adopted from [4] and [11] respectively.…”

Section: Model Architecturesmentioning

confidence: 99%

Section: Scn Architecturementioning

confidence: 99%

Section: Scn Architecturementioning

confidence: 99%

mentioning

confidence: 99%

See 3 more Smart Citations

Behavior of Keyword Spotting Networks Under Noisy Conditions

Mohanty,

Frischknecht,

Gerum

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

Use of Bidirectional Long Short Term Memory in Spoken Word Detection with reference to the Assamese language

Kalita¹,

Borbora²,

Nath³

2022

IJST

View full text Add to dashboard Cite

Objectives :The proposed method is based on a unique technique of Deep learning for identifying spoken words with reference to Assamese language. Most of the DNN based algorithms have been successfully implemented in the field of image recognition, computer vision, natural language processing and medical picture analysis. Methods: The method used here is the Bidirectional Long Short Term Memory (BLSTM). BLSTM incorporates both past and future situations together. The speech database for this research work is hired from the repository of Indian Language Technology Proliferation and Development Center (ILTP-DC). This repository contains 32,335 utterances by 1000 numbers of male and female participants, which is comprised of 262 unique Assamese native words. The BLSTM based recognition model is using 10 out of the 262 unique words and the remaining words are used in construction or generation of synthesized sentences. The feature extraction module uses 39 feature coefficients, which are composed of MFCC, ∆MFCC and ∆∆MFCC coefficients. Findings: The Word Error Rate (WER) of the BLSTM based recognition model is 18.84% with an average accuracy of 98.12%, which sets one promising benchmark when compared to recent findings. Novelty: In this work an attempt has been made with a different approach to detect certain keywords of Assamese language by adopting deep learning methodology. The future objective of this proposed work is to improve the detection capability of this model by considering multiple DNN models together in a hybrid approach along with the inclusion of additional features.

show abstract

Neural Architecture Search for Keyword Spotting

Mo¹,

Yu²,

Salameh³

et al. 2020

Interspeech 2020

View full text Add to dashboard Cite

Keyword spotting aims to identify specific keyword audio utterances. In recent years, deep convolutional neural networks have been widely utilized in keyword spotting systems. However, their model architectures are mainly based on off-the-shelf backbones such as VGG-Net or ResNet, instead of specially designed for the task. In this paper, we utilize neural architecture search to design convolutional neural network models that can boost the performance of keyword spotting while maintaining an acceptable memory footprint. Specifically, we search the model operators and their connections in a specific search space with Encoder-Decoder neural architecture optimization. Extensive evaluations on Google's Speech Commands Dataset show that the model architecture searched by our approach achieves a state-of-the-art accuracy of over 97%.

show abstract

Small-Footprint Keyword Spotting on Raw Audio Data with Sinc-Convolutions

Cited by 5 publications

References 0 publications

Behavior of Keyword Spotting Networks Under Noisy Conditions

Behavior of Keyword Spotting Networks Under Noisy Conditions

Use of Bidirectional Long Short Term Memory in Spoken Word Detection with reference to the Assamese language

Neural Architecture Search for Keyword Spotting

Contact Info

Product

Resources

About