2014
DOI: 10.1109/tmm.2014.2360798
|View full text |Cite
|
Sign up to set email alerts
|

Learning Salient Features for Speech Emotion <newline/>Recognition Using Convolutional <newline/>Neural Networks

Abstract: As an essential way of human emotional behavior understanding, speech emotion recognition (SER) has attracted a great deal of attention in human-centered signal processing. Accuracy in SER heavily depends on finding good affect-related, discriminative features. In this paper, we propose to learn affect-salient features for SER using convolutional neural networks (CNN). The training of CNN involves two stages. In the first stage, unlabeled samples are used to learn local invariant features (LIF) using a variant… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
224
0
1

Year Published

2015
2015
2023
2023

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 536 publications
(227 citation statements)
references
References 42 publications
2
224
0
1
Order By: Relevance
“…In the field of paralinguistics, several studies have been carried out using CNNs for feature learning, e.g., recently by Milde and Biemann [14], and Mao et al [15]. However, these works rely on a low-dimensional Mel filterbank feature vector and hence did not do a full end-to-end training of their system.…”
Section: Related Workmentioning
confidence: 99%
“…In the field of paralinguistics, several studies have been carried out using CNNs for feature learning, e.g., recently by Milde and Biemann [14], and Mao et al [15]. However, these works rely on a low-dimensional Mel filterbank feature vector and hence did not do a full end-to-end training of their system.…”
Section: Related Workmentioning
confidence: 99%
“…It is an attempt made to convey out a speaker independent system. Convolutional Neural Networks have been used by Qirong Mao et al [32] to learn affect salient features for Speech Emotion Recognition wherein there are two learning phases, simple features learnt in the lower layers and salient features learnt in higher layers and achievement is above 60% recognition rate amidst noise and channel distortion. Results show superior performance with respect to speaker, language variation and environement distortion.…”
Section: Speech Signalsmentioning
confidence: 99%
“…Motivated by the success of deep learning techniques in various application domains, such as large scale image and speech recognition [4,5], several Deep Neural Network (DNN) or Convolutional Neural Network (CNN) based SER methods have recently been proposed [6,7,8,9,10,11,12]. In [6,7], a multistage procedure was applied, in which the DNN and CNN network were trained for frontend feature extraction, followed by a backend emotion recognizer such as SVM and Extreme Learning Machine (ELM).…”
Section: Introductionmentioning
confidence: 99%
“…In [6,7], a multistage procedure was applied, in which the DNN and CNN network were trained for frontend feature extraction, followed by a backend emotion recognizer such as SVM and Extreme Learning Machine (ELM). More recent works have taken advantage of end-to-end training schemes [9,11].…”
Section: Introductionmentioning
confidence: 99%