2021
DOI: 10.1016/j.apacoust.2021.108260
|View full text |Cite
|
Sign up to set email alerts
|

Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
36
0
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 90 publications
(37 citation statements)
references
References 44 publications
0
36
0
1
Order By: Relevance
“…Convolution is done by applying filters to the input image data, which decreases its size (Yamashita et al, 2018). An additional operation called the Rectified Linear Unit (ReLU) (Atila and Sengür, 2021) was used after every convolution operation to generate a non-linear relationship between input and output. Finally, The pooling layer is used for secondary feature extraction, retaining the main features, reducing parameters, saving computing resources, preventing over-fitting, and improving model generalization (Suarez-Paniagua and Segura-Bedmar, 2018).…”
Section: Creating the Feature Mapsmentioning
confidence: 99%
“…Convolution is done by applying filters to the input image data, which decreases its size (Yamashita et al, 2018). An additional operation called the Rectified Linear Unit (ReLU) (Atila and Sengür, 2021) was used after every convolution operation to generate a non-linear relationship between input and output. Finally, The pooling layer is used for secondary feature extraction, retaining the main features, reducing parameters, saving computing resources, preventing over-fitting, and improving model generalization (Suarez-Paniagua and Segura-Bedmar, 2018).…”
Section: Creating the Feature Mapsmentioning
confidence: 99%
“…Their proposal obtained an accuracy of 71.61% for RAVDESS. Other works as proposed in [42][43][44], also employed CNNs, MLPs, or LSTMs to solve the emotion recognition task on RAVDESS using spectrograms or preprocessed features to feed these models, obtaining accuracies of 80.00%, 96.18%, and 81%, respectively.…”
Section: Speech Emotion Recognitionmentioning
confidence: 99%
“…A major concern of the investigation of this paper is that despite the increasing number of publications that use RAVDESS, they lack a common evaluation framework, which makes it difficult to compare contributions. For example, Atila et al [43] achieved 96.1% accuracy, and although they used a 10-CV evaluation, they did not specify how they distributed the users in their folds, not being clear whether in each fold the same user takes part of the training and the test set or not, which is crucial information to replicate their setup for comparing proposals. Another example of a different setup appears in Pepino et al [40], where they used 20 users for training, two for validation, and two for the test set, classifying only seven of the eight emotions that the dataset has.…”
Section: Speech Emotion Recognitionmentioning
confidence: 99%
“…Their proposal reached an accuracy of 71.61% for RAVDESS. Other works such as those proposed in [36][37][38] also employed CNNs, MLPs, or LSTMs to solve emotion recognition on RAVDESS using spectrograms or pre-processed features, obtaining accuracies of 80.00%, 96.18%, and 81%, respectively.…”
Section: Speech Emotion Recognitionmentioning
confidence: 99%
“…Although RAVDESS is appearing in a growing number of publications, there are not an standard evaluation criteria yet, making it complex to quantify and compare contributions. For example, in [37] they achieved 96.18% accuracy using a 10-CV evaluation. Nonetheless, they did not specify how they distributed users in each fold, making it unclear whether the same user participated in the training and test sets or not.…”
Section: Speech Emotion Recognitionmentioning
confidence: 99%