Learning Temporal Clusters Using Capsule Routing for Speech Emotion Recognition

Jalal, Asif; Loweimi, Erfan; Moore, Roger K.; Hain, Thomas

doi:10.21437/interspeech.2019-3068

Cited by 52 publications

(25 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Yuni et al [43] presented a spectrogram-based CNN model for multi-class audio classification on the combination of two models to achieve 64.48%, accuracy in multitask SER. Jalal et al [44] and Anjali et al [45] used the log spectrogram and spectral feature to recognize the emotion in speech data with 68% and 75%, accuracy, respectively. Table 10 shows the computational simplicity of the proposed DSCNN model with others baseline CNN model using IEMOCAP dataset for SER.…”

Section: Discussionmentioning

confidence: 99%

“…Zeng et al [43] Spectrograms --64.48% Jalal et al [44] log-spectrogram -69.4% 68.10% Bhavan et al [45] spectral features --75.69% Proposed model Raw_Spectrograms 68% 61% 70.00% Proposed model Clean_Spectrograms 80% 79% 79.5% Table 10. Computational comparison of the suggested DSCNN model with other baseline CNNs models.…”

Section: Input Weighted Accuracy Unweighted Accuracy Accuracymentioning

confidence: 99%

See 1 more Smart Citation

A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition

Mustaqeem

Kwon

2019

Sensors

252

View full text Add to dashboard Cite

Speech is the most significant mode of communication among human beings and a potential method for human-computer interaction (HCI) by using a microphone sensor. Quantifiable emotion recognition using these sensors from speech signals is an emerging area of research in HCI, which applies to multiple applications such as human-reboot interaction, virtual reality, behavior assessment, healthcare, and emergency call centers to determine the speaker's emotional state from an individual's speech. In this paper, we present major contributions for; (i) increasing the accuracy of speech emotion recognition (SER) compared to state of the art and (ii) reducing the computational complexity of the presented SER model. We propose an artificial intelligence-assisted deep stride convolutional neural network (DSCNN) architecture using the plain nets strategy to learn salient and discriminative features from spectrogram of speech signals that are enhanced in prior steps to perform better. Local hidden patterns are learned in convolutional layers with special strides to down-sample the feature maps rather than pooling layer and global discriminative features are learned in fully connected layers. A SoftMax classifier is used for the classification of emotions in speech. The proposed technique is evaluated on Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) datasets to improve accuracy by 7.85% and 4.5%, respectively, with the model size reduced by 34.5 MB. It proves the effectiveness and significance of the proposed SER technique and reveals its applicability in real-world applications.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Input Weighted Accuracy Unweighted Accuracy Accuracymentioning

confidence: 99%

A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition

Mustaqeem

Kwon

2019

Sensors

252

View full text Add to dashboard Cite

show abstract

“…Philosophically, it has similarity with the context expansion technique in feature-space minimum phone error (fMPE) [16,17]. Sequential and hybridhierarchical models were proposed to learn deep feature representations [12,14], and task-specific feature clusters [13].…”

Section: Related Workmentioning

confidence: 99%

“…DNNs learn task-specific abstract feature representations by filtering out unnecessary information and improving generalisation [8,9,10]. Research has suggested representation learning by modelling mid to long-term sequence dependencies [11,12,13].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Removing Bias with Residual Mixture of Multi-View Attention for Speech Emotion Recognition

et al. 2020

Self Cite

View full text Add to dashboard Cite

Speech emotion recognition is essential for obtaining emotional intelligence which affects the understanding of context and meaning of speech. The fundamental challenges of speech emotion recognition from a machine learning standpoint is to extract patterns which carry maximum correlation with the emotion information encoded in this signal, and to be as insensitive as possible to other types of information carried by speech. In this paper, a novel recurrent residual temporal context modelling framework is proposed. The framework includes mixture of multi-view attention smoothing and high dimensional feature projection for context expansion and learning feature representations. The framework is designed to be robust to changes in speaker and other distortions, and it provides state-of-the-art results for speech emotion recognition. Performance of the proposed approach is compared with a wide range of current architectures in a standard 4-class classification task on the widely used IEMOCAP corpus. A significant improvement of 4% unweighted accuracy over state-of-the-art systems is observed. Additionally, the attention vectors have been aligned with the input segments and plotted at two different attention levels to demonstrate the effectiveness.

show abstract

Optimal feature selection based speech emotion recognition using two‐stream deep convolutional neural network

Mustaqeem

Kwon

2021

Int J of Intelligent Sys

View full text Add to dashboard Cite

Speech signal processing is an active area of research, the most dominant source of exchanging information among human beings, and the best way for human–computer interaction (HCI). Human behavior assessments and emotion recognition from a speech signal, such as speech emotion recognition (SER) is an emerging HCI area of exploration with various real time claims. The performance of an efficient SER system depends on feature learning, which include salient and discriminative information such as high‐level deep features. In this paper, we proposed a two‐stream deep convolutional neural network with an iterative neighborhood component analysis (INCA) to learn mutually spatial‐spectral features and select the most discriminative optimal features for the final prediction. Our model is composed of two channels, and each channel is associated with the convolutional neural network structure to extract cues from the oral signals. The first channel extracts feature from the spectral domain, and the second channel extracts features from the spatial domain, which are then fused and fed to the INCA to remove the severance and select the optimal features for the final model training. The joint refine features are passed from the fully connected network with a softmax classifier to yield the predictions of the different emotions. We trained our proposed system using three benchmarks, which included the EMO‐DB, SAVEE, and RAVDESS emotional speech corpora, and we tested the prediction performance to secure 95%, 82%, and 85% recognition rates. The performance of the system shows the effectiveness and significance of the proposed system.

show abstract

Learning Temporal Clusters Using Capsule Routing for Speech Emotion Recognition

Cited by 52 publications

References 27 publications

A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition

A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition

Removing Bias with Residual Mixture of Multi-View Attention for Speech Emotion Recognition

Optimal feature selection based speech emotion recognition using two‐stream deep convolutional neural network

Contact Info

Product

Resources

About