2020
DOI: 10.3390/s20185212
|View full text |Cite
|
Sign up to set email alerts
|

Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features

Abstract: Artificial intelligence (AI) and machine learning (ML) are employed to make systems smarter. Today, the speech emotion recognition (SER) system evaluates the emotional state of the speaker by investigating his/her speech signal. Emotion recognition is a challenging task for a machine. In addition, making it smarter so that the emotions are efficiently recognized by AI is equally challenging. The speech signal is quite hard to examine using signal processing methods because it consists of different frequencies … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
54
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
7
2

Relationship

1
8

Authors

Journals

citations
Cited by 127 publications
(55 citation statements)
references
References 58 publications
1
54
0
Order By: Relevance
“…In contrary to that, there are few articles testing and analyzing the behavior of specific features and their settings in given conditions, e.g., testing frequency ranges or scales [ 44 ], etc. Nevertheless, published results measured even on the same database vary a lot, e.g., from approximately 50% [ 45 ] to even 92% [ 46 ], mainly due to the experimental set up, evaluation, processing, and classification. This study differs as it provides a unified, complex, and statistically rigorous analysis of great variety of basic speech properties, features and their settings, and calculation methods related to SER, by means of the machine learning.…”
Section: Discussionmentioning
confidence: 99%
“…In contrary to that, there are few articles testing and analyzing the behavior of specific features and their settings in given conditions, e.g., testing frequency ranges or scales [ 44 ], etc. Nevertheless, published results measured even on the same database vary a lot, e.g., from approximately 50% [ 45 ] to even 92% [ 46 ], mainly due to the experimental set up, evaluation, processing, and classification. This study differs as it provides a unified, complex, and statistically rigorous analysis of great variety of basic speech properties, features and their settings, and calculation methods related to SER, by means of the machine learning.…”
Section: Discussionmentioning
confidence: 99%
“…The method presented in this study consists of acoustic features, deep features, pre-trained CNN and SVM combined model. In many studies, acoustic and deep features are used separately [11], [12], [16], [17]. In this study, acoustic and deep features are combined to improve the semantic information of the emotion features in the speech.…”
Section: Proposed Methodsmentioning
confidence: 99%
“…Generally, the essential feature parameters utilized in the speech emotion recognition system can be separated into two categories in terms of conventional features and deep features. Features extracted from Convolutional Neural Network (CNN) layers are generally used as deep features [11], [12]. In [13], to recognize emotions from speech, a method that is based on MFCC features and Gaussian mixture model classifier is proposed.…”
Section: Related Workmentioning
confidence: 99%
“…Recent SER models based on deep-learning architectures [ 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 ] have demonstrated state-of-the-art performance with an attention mechanism [ 19 , 20 , 22 , 23 , 25 , 26 ]. The deep-learning architectures adopted in previous studies included recurrent neural networks (RNN) [ 19 ], convolutional neural networks (CNN) [ 24 ], and convolutional RNNs (CRNN) [ 20 , 26 ]. Liu et al [ 21 ] presented an SER model of a decision tree for an extreme learning machine having a single hidden-layer feed-forward neural network, using a mixture of deep learning and typical classification techniques.…”
Section: Related Workmentioning
confidence: 99%