2019
DOI: 10.1109/access.2019.2891838
|View full text |Cite
|
Sign up to set email alerts
|

Effective Combination of DenseNet and BiLSTM for Keyword Spotting

Abstract: Keyword spotting (KWS) is a major component of human-computer interaction for smart on-device terminals and service robots, the purpose of which is to maximize the detection accuracy while keeping footprint size small. In this paper, based on the powerful ability of DenseNet on extracting local feature-maps, we propose a new network architecture (DenseNet-BiLSTM) for KWS. In our DenseNet-BiLSTM, the DenseNet is primarily applied to obtain local features, while the BiLSTM is used to grab time series features. I… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
49
0
14

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 74 publications
(63 citation statements)
references
References 17 publications
0
49
0
14
Order By: Relevance
“…This method extracts features by using the Alex Net model and a trained conventional classifier, which is a support vector machine (SVM), to predict the emotions [ 38 ]. A CNN model extracts features from the whole utterance and feeds them to the LSTM or the RNNs to extract long term contextual dependencies in the speech signals [ 17 ]. Wen et al [ 39 ] presented a method for the SER using the DBN and the SVM where the high-level features are extracted by the DBN and then classified by the SVM.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…This method extracts features by using the Alex Net model and a trained conventional classifier, which is a support vector machine (SVM), to predict the emotions [ 38 ]. A CNN model extracts features from the whole utterance and feeds them to the LSTM or the RNNs to extract long term contextual dependencies in the speech signals [ 17 ]. Wen et al [ 39 ] presented a method for the SER using the DBN and the SVM where the high-level features are extracted by the DBN and then classified by the SVM.…”
Section: Methodsmentioning
confidence: 99%
“…However, the FCNs model is not able to learn temporal features in this regard. The recurrent neural network (RNN) and the LSTM show good performances to model temporal dependency among the sequences [ 14 , 17 ]. The RNN-LSTM network is suitable to learn long term contextual dependencies, and it is widely used in the SER domain [ 18 ].…”
Section: Introductionmentioning
confidence: 99%
“…Spectrogram is a suitable representation for CNNs model to extract high-level discriminative features from speech signals to recognize the emotional state of the speaker in the SER system [20]. Similarly, LSTM-RNNs are mostly used to learn hidden temporal information in speech signals which is cyclically employed in the SER system [21], [22]. Nowadays, deep learning approaches play a crucial role to increasing the research interest in SER.…”
Section: Literature Review Of Sermentioning
confidence: 99%
“…Modern implementations of KWS algorithms either use sequence to sequence models such as Long Short-Term Memory (LSTM) based networks [8] work (CNN) based models [9] since the preprocessed input can be considered an image representing sound over time-frequency axes. Other variants include ResNets which are CNNs with skip connections [10].…”
Section: Related Workmentioning
confidence: 99%