2019
DOI: 10.3390/s19122730
|View full text |Cite
|
Sign up to set email alerts
|

Speech Emotion Recognition with Heterogeneous Feature Unification of Deep Neural Network

Abstract: Automatic speech emotion recognition is a challenging task due to the gap between acoustic features and human emotions, which rely strongly on the discriminative acoustic features extracted for a given recognition task. We propose a novel deep neural architecture to extract the informative feature representations from the heterogeneous acoustic feature groups which may contain redundant and unrelated information leading to low emotion recognition performance in this work. After obtaining the informative featur… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
44
0
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 71 publications
(45 citation statements)
references
References 45 publications
0
44
0
1
Order By: Relevance
“…Pérez-Espinosa et al [35] analyzed 6920 acoustic features from the interactive emotion dyadic motion capture (IEMOCAP) database to discover that features based on groups of MFCC, LPC and cochleagrams are important for estimating valence, activation and dominance emotions in speech respectively. The authors in [36] proposed a DNN architecture for extracting informative feature representatives from heterogeneous acoustic feature groups that may contain redundant and unrelated information. The architecture was tested by training the fusion network to jointly learn highly discriminating acoustic feature representations from the IEMOCAP database for speech emotion recognition using SVM to obtain an overall accuracy of 64.0%.…”
Section: Related Studiesmentioning
confidence: 99%
See 1 more Smart Citation
“…Pérez-Espinosa et al [35] analyzed 6920 acoustic features from the interactive emotion dyadic motion capture (IEMOCAP) database to discover that features based on groups of MFCC, LPC and cochleagrams are important for estimating valence, activation and dominance emotions in speech respectively. The authors in [36] proposed a DNN architecture for extracting informative feature representatives from heterogeneous acoustic feature groups that may contain redundant and unrelated information. The architecture was tested by training the fusion network to jointly learn highly discriminating acoustic feature representations from the IEMOCAP database for speech emotion recognition using SVM to obtain an overall accuracy of 64.0%.…”
Section: Related Studiesmentioning
confidence: 99%
“…Emotion recognition is still a great challenge because of several reasons as previously alluded in the introductory message. Further reasons include the existence of a gap between acoustic features and human emotions [31,36] and the non-existence of a solid theoretical foundation relating the characteristics of voice to the emotions of a speaker [20]. These intrinsic challenges have led to the disagreement in the literature on which features are best for speech emotion recognition [20,36].…”
Section: Related Studiesmentioning
confidence: 99%
“…The algorithm used for this segmentation is based only on acoustic information that make it easy to be re-implemented in real-time. Suppose the utterance is segmented into its voiced segments using this algorithm, the output of segmentation process are the waveforms of all voiced segments which can be written as follows: (1) where (i) is the utterance index, represents the sequence of all voiced segments for utterance , is the jth voiced segment, and is the number of voiced segments in this utterance. For example, the result of segmentation of an utterance into its voiced segments is shown in Figure 1.…”
Section: Proposed Emotion Unitmentioning
confidence: 99%
“…Automatic speech emotion recognition (ASER) systems becoming a very important technology for human-computer interaction [1]. This technology can be embedded in many computer applications as well as in robots to make them sensitive to the user's emotional voice [2].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation