ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9747517
|View full text |Cite
|
Sign up to set email alerts
|

Speech Emotion Recognition with Global-Aware Fusion on Multi-Scale Feature Representation

Abstract: Speech Emotion Recognition (SER) is a fundamental task to predict the emotion label from speech data. Recent works mostly focus on using convolutional neural networks (CNNs) to learn local attention map on fixed-scale feature representation by viewing time-varied spectral features as images. However, rich emotional feature at different scales and important global information are not able to be well captured due to the limits of existing CNNs for SER. In this paper, we propose a novel GLobal-Aware Multi-scale (… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
10
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 38 publications
(10 citation statements)
references
References 16 publications
0
10
0
Order By: Relevance
“…After acquiring the 3D log-mel spectrogram, inspired by [57], we present a Residual Multi-Scale Covolutional Neural Network (RMSCNN) structure, which is shown in Figure 2b. We create two distinct convolutional kernels (1,3) and (3,1) to convolve the log-mel spectrogram before feeding it into the RMSCNN block.…”
Section: Residual Multi-scale Convolutional Neutral Networkmentioning
confidence: 99%
See 2 more Smart Citations
“…After acquiring the 3D log-mel spectrogram, inspired by [57], we present a Residual Multi-Scale Covolutional Neural Network (RMSCNN) structure, which is shown in Figure 2b. We create two distinct convolutional kernels (1,3) and (3,1) to convolve the log-mel spectrogram before feeding it into the RMSCNN block.…”
Section: Residual Multi-scale Convolutional Neutral Networkmentioning
confidence: 99%
“…By employing several scales and receptive fields of various sizes, the collection of more accurate emotional information is encouraged. In contrast to [57], we set a skip connection, by this creating a residual structure after the output of both convolutional layers; the same input and output channel numbers are maintained. The output feature maps are then concatenated along the channel dimension, and the BN and ReLU layers are added in the next step, as shown in the upper part of Figure 2c.…”
Section: Residual Multi-scale Convolutional Neutral Networkmentioning
confidence: 99%
See 1 more Smart Citation
“…Jiale Yan [17] introduced TT-MLP using tensor decomposition to compress deep MLPs, thus reducing the amount of training parameters. Wenjing Zhu [18] utilized GMLP components and the global attention module (Glam) to achieve good results. However, speech emotional information is not uniformly distributed across speech features, and the aforementioned MLP structures did not consider how to focus the model on the required information.…”
Section: Introductionmentioning
confidence: 99%
“…It uses deep neural networks to generate probability distributions of emotional states and construct discourse-level features, which are then fed into an Extreme Learning Machine (ELM) to identify discourse-level emotions. Zhu and Li ( 2022 ) used a global-aware fusion module to capture the most important emotional information across various scales. On the other hand, transfer learning is extensively implemented in the SER field.…”
Section: Introductionmentioning
confidence: 99%