2022
DOI: 10.3390/electronics11091328
|View full text |Cite
|
Sign up to set email alerts
|

Advanced Fusion-Based Speech Emotion Recognition System Using a Dual-Attention Mechanism with Conv-Caps and Bi-GRU Features

Abstract: Recognizing the speaker’s emotional state from speech signals plays a very crucial role in human–computer interaction (HCI). Nowadays, numerous linguistic resources are available, but most of them contain samples of a discrete length. In this article, we address the leading challenge in Speech Emotion Recognition (SER), which is how to extract the essential emotional features from utterances of a variable length. To obtain better emotional information from the speech signals and increase the diversity of the i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
9
1

Relationship

0
10

Authors

Journals

citations
Cited by 33 publications
(11 citation statements)
references
References 75 publications
0
3
0
Order By: Relevance
“…Latif et al [21] also employed several DL models on raw speech. Several modified DL models are also observed for SER, including capsule neural network [22], 3D CNN-LSTM model [10], 3D CNN using k-means clustering [8], 2D CNN with a self-attention dilated residual network [9], Spiking Neural Network (SNN) [23], convolutional capsule (Conv-Cap) and bi-directional gated recurrent unit (Bi-GRU) [24], attention-LSTM-attention [25], Bi-GRU with attention mechanism [26], CNN with a capsule neural network (Caps Net) [13], temporal CNN with self-attention transfer network (SATN) [27], 1D CNN based on the multi-learning trick (MLT) [14], cascaded denoising CNN (Dn-CNN) [28], and a pre-trained deep CNN model with attention [29]. The learning features are the main attraction of different DL-based SER methods.…”
Section: Ser With Deep Learning (Dl) and Signal Transformationmentioning
confidence: 99%
“…Latif et al [21] also employed several DL models on raw speech. Several modified DL models are also observed for SER, including capsule neural network [22], 3D CNN-LSTM model [10], 3D CNN using k-means clustering [8], 2D CNN with a self-attention dilated residual network [9], Spiking Neural Network (SNN) [23], convolutional capsule (Conv-Cap) and bi-directional gated recurrent unit (Bi-GRU) [24], attention-LSTM-attention [25], Bi-GRU with attention mechanism [26], CNN with a capsule neural network (Caps Net) [13], temporal CNN with self-attention transfer network (SATN) [27], 1D CNN based on the multi-learning trick (MLT) [14], cascaded denoising CNN (Dn-CNN) [28], and a pre-trained deep CNN model with attention [29]. The learning features are the main attraction of different DL-based SER methods.…”
Section: Ser With Deep Learning (Dl) and Signal Transformationmentioning
confidence: 99%
“…MLT‐DNet [72], which is based on one‐dimensional dilated CNNs where the model uses a multi‐learning technique to extract spatial salient emotional features and long‐term contextual dependencies from speech signals. The other baselines include FaceNet [73] takes spectrogram and waveform as input, HGFM [74] is a hierarchical‐grained feature model, DualNet [22] composed of an attention‐based BLSTM, The graph attention‐based GRU (GA‐GRU) [75], XGBoost [76] Dual Att‐BLSTM [77], 3D‐CNN + ASRNN [78], and AMS‐Net [11].…”
Section: Methodsmentioning
confidence: 99%
“…Bubai et. al [53] presented Mel-spectrograms using the Conv-Cap module, and the remaining spectral characteristics from the input tensor were obtained using the Bi-GRU. Each module made use of the self-attention layer to identify the attention weight and preferentially concentrate on the best cues in order to produce high-level features.…”
Section: Machine Learning Techniques Used For Emotion Detectionmentioning
confidence: 99%