2021
DOI: 10.3390/s21134399
|View full text |Cite
|
Sign up to set email alerts
|

Cascaded Convolutional Neural Network Architecture for Speech Emotion Recognition in Noisy Conditions

Abstract: Convolutional neural networks (CNNs) are a state-of-the-art technique for speech emotion recognition. However, CNNs have mostly been applied to noise-free emotional speech data, and limited evidence is available for their applicability in emotional speech denoising. In this study, a cascaded denoising CNN (DnCNN)–CNN architecture is proposed to classify emotions from Korean and German speech in noisy conditions. The proposed architecture consists of two stages. In the first stage, the DnCNN exploits the concep… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
9
1

Relationship

0
10

Authors

Journals

citations
Cited by 27 publications
(11 citation statements)
references
References 45 publications
0
11
0
Order By: Relevance
“…Latif et al [21] also employed several DL models on raw speech. Several modified DL models are also observed for SER, including capsule neural network [22], 3D CNN-LSTM model [10], 3D CNN using k-means clustering [8], 2D CNN with a self-attention dilated residual network [9], Spiking Neural Network (SNN) [23], convolutional capsule (Conv-Cap) and bi-directional gated recurrent unit (Bi-GRU) [24], attention-LSTM-attention [25], Bi-GRU with attention mechanism [26], CNN with a capsule neural network (Caps Net) [13], temporal CNN with self-attention transfer network (SATN) [27], 1D CNN based on the multi-learning trick (MLT) [14], cascaded denoising CNN (Dn-CNN) [28], and a pre-trained deep CNN model with attention [29]. The learning features are the main attraction of different DL-based SER methods.…”
Section: Ser With Deep Learning (Dl) and Signal Transformationmentioning
confidence: 99%
“…Latif et al [21] also employed several DL models on raw speech. Several modified DL models are also observed for SER, including capsule neural network [22], 3D CNN-LSTM model [10], 3D CNN using k-means clustering [8], 2D CNN with a self-attention dilated residual network [9], Spiking Neural Network (SNN) [23], convolutional capsule (Conv-Cap) and bi-directional gated recurrent unit (Bi-GRU) [24], attention-LSTM-attention [25], Bi-GRU with attention mechanism [26], CNN with a capsule neural network (Caps Net) [13], temporal CNN with self-attention transfer network (SATN) [27], 1D CNN based on the multi-learning trick (MLT) [14], cascaded denoising CNN (Dn-CNN) [28], and a pre-trained deep CNN model with attention [29]. The learning features are the main attraction of different DL-based SER methods.…”
Section: Ser With Deep Learning (Dl) and Signal Transformationmentioning
confidence: 99%
“…With respect to the recognition of speech and birdsong, many researchers often learn multi-scale features directly from waveforms 42 – 44 or use short-time Fourier transforms 19 , 45 , and Mel filters 21 , 46 to generate spectrograms as input into CNN. Mel filtering is designed to imitate human hearing habits, and there is a lack of evidence about whether birds have the same characteristics.…”
Section: Resultsmentioning
confidence: 99%
“…Ren et al [34] used the two-stage cascaded model to select the most promising candidate sets to achieve the image set classification task. Nam et al [35] removed noise signal in the first stage and realize target signal classification in the second stage.…”
Section: Methodsmentioning
confidence: 99%