2019
DOI: 10.3390/s20010183
|View full text |Cite
|
Sign up to set email alerts
|

A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition

Abstract: Speech is the most significant mode of communication among human beings and a potential method for human-computer interaction (HCI) by using a microphone sensor. Quantifiable emotion recognition using these sensors from speech signals is an emerging area of research in HCI, which applies to multiple applications such as human-reboot interaction, virtual reality, behavior assessment, healthcare, and emergency call centers to determine the speaker's emotional state from an individual's speech. In this paper, we … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
107
0
3

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 252 publications
(111 citation statements)
references
References 37 publications
1
107
0
3
Order By: Relevance
“…Mustaqeem and Kwon [ 33 ] revealed the amount of energy transmitted by a sound wave is correlated to the amplitude of a sound wave. The amplitude of a sound wave denotes the maximum displacement of an element in the middle from its rest location.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Mustaqeem and Kwon [ 33 ] revealed the amount of energy transmitted by a sound wave is correlated to the amplitude of a sound wave. The amplitude of a sound wave denotes the maximum displacement of an element in the middle from its rest location.…”
Section: Related Workmentioning
confidence: 99%
“…This method has a drawback in that the classifying emotions can be time consuming because the audio file must be analyzed and converted to audio without noise or silence for preprocessing. In the aforementioned studies [ 29 , 30 , 31 , 32 , 33 ], local correlations between spectral features could be ignored by using normalized spectral features from pre-processing.…”
Section: Related Workmentioning
confidence: 99%
“…Local features are spectral lowlevel descriptors (LLDs) (Mel Filter Bank, MFCCs), energy and voice LLDs (loudness, F0, jitter, shimmer); global features involve functions extracted from the LLDs, such as maximum, minimum, mean, standard deviation, duration, regression coefficients [20]. The deep learning methods are deep neural networks (DNN), deep stacked auto-encoder (SAE), convolutional neural network (CNN) [21], [22], longshort term memory network (LSTM) [21], recurrent neural network (RNN) and other similar methods [23].…”
Section: Acoustic Features In Ser Literaturementioning
confidence: 99%
“…Multi-layer perceptron is also a supervised neural network learning model used in the SER [35]. Deep-learning classifiers widely used in the SER domain as classification techniques are RNN [23], CNN [21], [22], DNN [36], LSTM network [8], auto encoders, multitask learning, transfer learning, and attention mechanism [6], [15].…”
Section: Classification In the Ser Literaturementioning
confidence: 99%
“…Nowadays, mostly researchers utilize deep learning techniques for SER using Mel-scale filter bank speech spectrogram as an input feature. A spectrogram is a 2-D representation of speech signals which is widely used in convolutional neural networks (CNNs) for extracting the salient and discriminative features in SER [2] and other signal processing applications [3], [4]. Mostly 2-D CNNs are specially designed for visual recognition tasks [5]- [7] and researchers are inspired by their performance to explore 2-D CNNs in the field of SER.…”
Section: Introduction Of Sermentioning
confidence: 99%