2019
DOI: 10.3390/computers8040091
|View full text |Cite
|
Sign up to set email alerts
|

An Investigation of a Feature-Level Fusion for Noisy Speech Emotion Recognition

Abstract: Because one of the key issues in improving the performance of Speech Emotion Recognition (SER) systems is the choice of an effective feature representation, most of the research has focused on developing a feature level fusion using a large set of features. In our study, we propose a relatively low-dimensional feature set that combines three features: baseline Mel Frequency Cepstral Coefficients (MFCCs), MFCCs derived from Discrete Wavelet Transform (DWT) sub-band coefficients that are denoted as DMFCC, and pi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 19 publications
(3 citation statements)
references
References 42 publications
0
3
0
Order By: Relevance
“…This conclusion has been further confirmed in [35]. Additionally, the method strongly relies on prior knowledge of noise [36], which limits their applications. Hsu et al [37] explored adversarial training for disentangling speaker attribute from noise attribute.…”
Section: Introductionmentioning
confidence: 60%
“…This conclusion has been further confirmed in [35]. Additionally, the method strongly relies on prior knowledge of noise [36], which limits their applications. Hsu et al [37] explored adversarial training for disentangling speaker attribute from noise attribute.…”
Section: Introductionmentioning
confidence: 60%
“…Wang et al [23] took advantage of the sequential floating forward search (SFFS) method to select the most efficient feature subset of wavelet packets decomposition based on the RBF kernel support vector machine for speaker-independent. Sekkate et al [24] employed a relatively low-dimensional feature set that combines three features including mel frequency cepstral coefficients (MFCCs) derived from discrete wavelet transform (DWT) sub-band coefficients (DMFCC), and they utilized feature extraction algorithm from above feature set to acquire better results based on the clean conditions and several noised environments in specific emotional databases. Finally, we summarize above the aforementioned works shown in Table 1, where SI and SD are speaker-independent and speaker-dependent respectively.…”
Section: Related Workmentioning
confidence: 99%
“…The feature set includes different kinds of speech features such as: energy, pitch, voiced, formant and cepstrum. For example, pitch, also known as fundamental frequency F0, is a feature that corresponds to the frequency of vibration of the vocal folds [24].…”
Section: Wavelet Packet Reconstruction Sequence Generationmentioning
confidence: 99%