Effective Combination of DenseNet and BiLSTM for Keyword Spotting

Zeng, Mengjun; Xiao, Ningchuan

doi:10.1109/access.2019.2891838

Cited by 74 publications

(63 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…This method extracts features by using the Alex Net model and a trained conventional classifier, which is a support vector machine (SVM), to predict the emotions [ 38 ]. A CNN model extracts features from the whole utterance and feeds them to the LSTM or the RNNs to extract long term contextual dependencies in the speech signals [ 17 ]. Wen et al [ 39 ] presented a method for the SER using the DBN and the SVM where the high-level features are extracted by the DBN and then classified by the SVM.…”

Section: Methodsmentioning

confidence: 99%

“…However, the FCNs model is not able to learn temporal features in this regard. The recurrent neural network (RNN) and the LSTM show good performances to model temporal dependency among the sequences [ 14 , 17 ]. The RNN-LSTM network is suitable to learn long term contextual dependencies, and it is widely used in the SER domain [ 18 ].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features

Tursunov

Mustaqeem

Kwon

2020

Sensors

127

View full text Add to dashboard Cite

Artificial intelligence (AI) and machine learning (ML) are employed to make systems smarter. Today, the speech emotion recognition (SER) system evaluates the emotional state of the speaker by investigating his/her speech signal. Emotion recognition is a challenging task for a machine. In addition, making it smarter so that the emotions are efficiently recognized by AI is equally challenging. The speech signal is quite hard to examine using signal processing methods because it consists of different frequencies and features that vary according to emotions, such as anger, fear, sadness, happiness, boredom, disgust, and surprise. Even though different algorithms are being developed for the SER, the success rates are very low according to the languages, the emotions, and the databases. In this paper, we propose a new lightweight effective SER model that has a low computational complexity and a high recognition accuracy. The suggested method uses the convolutional neural network (CNN) approach to learn the deep frequency features by using a plain rectangular filter with a modified pooling strategy that have more discriminative power for the SER. The proposed CNN model was trained on the extracted frequency features from the speech data and was then tested to predict the emotions. The proposed SER model was evaluated over two benchmarks, which included the interactive emotional dyadic motion capture (IEMOCAP) and the berlin emotional speech database (EMO-DB) speech datasets, and it obtained 77.01% and 92.02% recognition results. The experimental results demonstrated that the proposed CNN-based SER system can achieve a better recognition performance than the state-of-the-art SER systems.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features

Tursunov

Mustaqeem

Kwon

2020

Sensors

127

View full text Add to dashboard Cite

show abstract

“…Spectrogram is a suitable representation for CNNs model to extract high-level discriminative features from speech signals to recognize the emotional state of the speaker in the SER system [20]. Similarly, LSTM-RNNs are mostly used to learn hidden temporal information in speech signals which is cyclically employed in the SER system [21], [22]. Nowadays, deep learning approaches play a crucial role to increasing the research interest in SER.…”

Section: Literature Review Of Sermentioning

confidence: 99%

Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM

2020

View full text Add to dashboard Cite

Emotional state recognition of a speaker is a difficult task for machine learning algorithms which plays an important role in the field of speech emotion recognition (SER). SER plays a significant role in many real-time applications such as human behavior assessment, human-robot interaction, virtual reality, and emergency centers to analyze the emotional state of speakers. Previous research in this field is mostly focused on handcrafted features and traditional convolutional neural network (CNN) models used to extract high-level features from speech spectrograms to increase the recognition accuracy and overall model cost complexity. In contrast, we introduce a novel framework for SER using a key sequence segment selection based on redial based function network (RBFN) similarity measurement in clusters. The selected sequence is converted into a spectrogram by applying the STFT algorithm and passed into the CNN model to extract the discriminative and salient features from the speech spectrogram. Furthermore, we normalize the CNN features to ensure precise recognition performance and feed them to the deep bi-directional long short-term memory (BiLSTM) to learn the temporal information for recognizing the final state of emotion. In the proposed technique, we process the key segments instead of the whole utterance to reduce the computational complexity of the overall model and normalize the CNN features before their actual processing, so that it can easily recognize the Spatio-temporal information. The proposed system is evaluated over different standard dataset including IEMOCAP, EMO-DB, and RAVDESS to improve the recognition accuracy and reduce the processing time of the model, respectively. The robustness and effectiveness of the suggested SER model is proved from the experimentations when compared to state-of-the-art SER methods with an achieve up to 72.25%, 85.57%, and 77.02% accuracy over IEMOCAP, EMO-DB, and RAVDESS dataset, respectively. INDEX TERMS Speech emotion recognition, deep bidirectional long shot term memory, key segment sequence selection, normalization of CNN features, radial-based function network (RBFN).

show abstract

“…Modern implementations of KWS algorithms either use sequence to sequence models such as Long Short-Term Memory (LSTM) based networks [8] work (CNN) based models [9] since the preprocessed input can be considered an image representing sound over time-frequency axes. Other variants include ResNets which are CNNs with skip connections [10].…”

Section: Related Workmentioning

confidence: 99%

Binary Speech Features for Keyword Spotting Tasks

Riviello¹,

David

2019

Interspeech 2019

View full text Add to dashboard Cite

Keyword spotting is a classification task which aims to detect a specific set of spoken words. In general, this type of task runs on a power-constrained device such as a smartphone. One method to reduce the power consumption of a keyword spotting algorithm (typically a neural network) is to reduce the precision of the network weights and activations. In this paper, we propose a new representation of speech features which is more adapted to low-precision networks and compatible with binary/ternary neural networks. The new representation is based on the log-Mel spectrogram and models the variation of power over time. Tested on a ResNet, this representation produces results nearly as accurate as full-precision MFCCs, which are traditionally used in speech recognition applications.

show abstract

Effective Combination of DenseNet and BiLSTM for Keyword Spotting

Cited by 74 publications

References 17 publications

Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features

Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features

Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM

Binary Speech Features for Keyword Spotting Tasks

Contact Info

Product

Resources

About