ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414635
|View full text |Cite
|
Sign up to set email alerts
|

Speech Emotion Recognition with Multiscale Area Attention and Data Augmentation

Abstract: In Speech Emotion Recognition (SER), emotional characteristics often appear in diverse forms of energy patterns in spectrograms. Typical attention neural network classifiers of SER are usually optimized on a fixed attention granularity. In this paper, we apply multiscale area attention in a deep convolutional neural network to attend emotional characteristics with varied granularities and therefore the classifier can benefit from an ensemble of attentions with different scales. To deal with data sparsity, we c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
24
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 52 publications
(24 citation statements)
references
References 20 publications
0
24
0
Order By: Relevance
“…Besides investigating performance of SER models on clean test data, it is important to show that they also work well under more challenging conditions. Even though augmentation methods have been used to improve performance on clean test data [36,37], only a few studies have evaluated performance on augmented test data as well. Jaiswal and Provost [38] and Pappagari et al [39] have shown that previous SER models show robustness issues, particularly for background noise and reverb.…”
mentioning
confidence: 99%
“…Besides investigating performance of SER models on clean test data, it is important to show that they also work well under more challenging conditions. Even though augmentation methods have been used to improve performance on clean test data [36,37], only a few studies have evaluated performance on augmented test data as well. Jaiswal and Provost [38] and Pappagari et al [39] have shown that previous SER models show robustness issues, particularly for background noise and reverb.…”
mentioning
confidence: 99%
“…To compare FrAUG with conventional data augmentation techniques, DepAudioNet models were trained using the DAIC-WOZ dataset with FrAUG, noise augmentation [25], VTLP-based augmentation [32,33], speed perturbation and pitch perturbation [16]. The noise augmentation method was similar to the one used in Kaldi.…”
Section: Fraug Versus Conventional Data Augmentation Methodsmentioning
confidence: 99%
“…Furthermore, a global-aware fusion module is introduced to select important information reflecting emotional patterns in different scales. Experiments show both multi-scale feature representation and global-aware fusion can bring great improvements compared to the state-of-the-art CNN model [9].…”
Section: Introductionmentioning
confidence: 99%
“…Recently, inspired by the uplifting progress in computer vision, existing studies [2,3,4,5,6,7,8,9] have achieved great improvements on SER by viewing spectral features as images. The standard convolutional neural network (CNN) architecture mainly consists of feature representation and attention map.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation