Speech Emotion Recognition with Co-Attention Based Multi-Level Acoustic Information

Heqing, Zou,; Si, Yuke; Chen, Chen; Rajan, Dinesh; Chng, Eng Siong

doi:10.1109/icassp43922.2022.9747095

Cited by 83 publications

(28 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CTC+Attention [10] 67.0 69.0 Head Fusion [9] 76.2 76.4 HGFM [23] 66.6 70.5 DAAE+CNN+Attention [24] 70.1 70.7 HNSD [25] 70.5 72.5 CNN-ELM+STC attention [12] 61.3 60. 4 Multi-level Co-att [11] 71.6 72.7 DKDFMH 79.1 77.1…”

Section: Methods Wa Uamentioning

confidence: 99%

See 1 more Smart Citation

hierarchical network with decoupled knowledge distillation for speech emotion recognition

Zhao¹,

Wang²,

Wang³

et al. 2023

Preprint

View full text Add to dashboard Cite

The goal of Speech Emotion Recognition (SER) is to enable computers to recognize the emotion category of a given utterance in the same way that humans do. The accuracy of SER is strongly dependent on the validity of the utterance-level representation obtained by the model. Nevertheless, the "dark knowledge" carried by non-target classes is always ignored by previous studies. In this paper, we propose a hierarchical network, called DKDFMH, which employs decoupled knowledge distillation in a deep convolutional neural network with a fused multi-head attention mechanism. Our approach applies logit distillation to obtain higher-level semantic features from different scales of attention sets and delve into the knowledge carried by non-target classes, thus guiding the model to focus more on the differences between sentiment features. To validate the effectiveness of our model, we conducted experiments on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset. We achieved competitive performance, with 79.1 % weighted accuracy (WA) and 77.1 % unweighted accuracy (UA). To the best of our knowledge, this is the first time since 2015 that logit distillation has been returned to state-of-the-art status.

show abstract

Section: Methods Wa Uamentioning

confidence: 99%

“…Head Fusion was proposed in [9] by fusing multi-attention heads in the same attention map. In the field of SER, [10,11,12] have shown that the attention mechanism performs well on several datasets, highlighting its effectiveness for sentiment classification.…”

Section: Related Workmentioning

confidence: 99%

hierarchical network with decoupled knowledge distillation for speech emotion recognition

Zhao¹,

Wang²,

Wang³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…WA UA Supervised Methods CNN-ELM+STC attention [29] 61.32 60.43 Audio 25 [30] 60.64 61.32 IS09-classification [31] 68.10 63.80 Co-attention-based fusion [32] 69.80 71.05 Self-supervised Methods Wav2Vec [33] 59.79 -Data2Vec Large [34] 66.31 -WavLM Large [35] 70.62 -HuBERT Large 70.24 71.13 Data Augmentation Methods GAN [10] -53.60 CycleGAN [12] -60.37 VTLP [36] 66.90 65.30 MWA-SER [37] -66.00 HuBERT Large + CopyPaste [28] 70.79 71.35 HuBERT Large + Speed Perturbation [9] 70. 35…”

Section: Methodsmentioning

confidence: 99%

Data Augmentation with Unsupervised Speaking Style Transfer for Speech Emotion Recognition

Qu¹,

Wang²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

Currently, the performance of Speech Emotion Recognition (SER) systems is mainly constrained by the absence of large-scale labelled corpora. Data augmentation is regarded as a promising approach, which borrows methods from Automatic Speech Recognition (ASR), for instance, perturbation on speed and pitch, or generating emotional speech utilizing generative adversarial networks. In this paper, we propose EmoAug, a novel style transfer model to augment emotion expressions, in which a semantic encoder and a paralinguistic encoder represent verbal and non-verbal information respectively. Additionally, a decoder reconstructs speech signals by conditioning on the aforementioned two information flows in an unsupervised fashion. Once training is completed, EmoAug enriches expressions of emotional speech in different prosodic attributes, such as stress, rhythm and intensity, by feeding different styles into the paralinguistic encoder. In addition, we can also generate similar numbers of samples for each class to tackle the data imbalance issue. Experimental results on the IEMOCAP dataset demonstrate that EmoAug can successfully transfer different speaking styles while retaining the speaker identity and semantic content. Furthermore, we train a SER model with data augmented by EmoAug and show that it not only surpasses the state-of-theart supervised and self-supervised methods but also overcomes overfitting problems caused by data imbalance. Some audio samples can be found on our demo website 1 .

show abstract

“…Based on Transformer, various selfsupervised speech representation learning approaches have also been proposed, including wav2vec [51], wav2vec 2.0 [52] and HuBERT [53]. Built on the pretrained self-supervised models, several researches have delivered promising results in the literature [33], [34], [39], [49], [54]- [57]. Typically, Monica et al [33] fine-tuned the pretrained HuBERT model for AD detection and achieved competitive performance.…”

Section: B Transformer In Paralinguistic Speech Processingmentioning

confidence: 99%

SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic Speech Processing

Chen,

Xing,

et al. 2023

Preprint

View full text Add to dashboard Cite

Paralinguistic speech processing is important in addressing many issues, such as sentiment and neurocognitive disorder analyses. Recently, Transformer has achieved remarkable success in the natural language processing field and has demonstrated its adaptation to speech. However, previous works on Transformer in the speech field have not incorporated the properties of speech, leaving the full potential of Transformer unexplored. In this paper, we consider the characteristics of speech and propose a general structure-based framework, called SpeechFormer++, for paralinguistic speech processing. More concretely, following the component relationship in the speech signal, we design a unit encoder to model the intra-and inter-unit information (i.e., frames, phones, and words) efficiently. According to the hierarchical relationship, we utilize merging blocks to generate features at different granularities, which is consistent with the structural pattern in the speech signal. Moreover, a word encoder is introduced to integrate word-grained features into each unit encoder, which effectively balances fine-grained and coarse-grained information. SpeechFormer++ is evaluated on the speech emotion recognition (IEMOCAP & MELD), depression classification (DAIC-WOZ) and Alzheimer's disease detection (Pitt) tasks. The results show that SpeechFormer++ outperforms the standard Transformer while greatly reducing the computational cost. Furthermore, it delivers superior results compared to the state-of-the-art approaches.

show abstract

Speech Emotion Recognition with Co-Attention Based Multi-Level Acoustic Information

Cited by 83 publications

References 20 publications

hierarchical network with decoupled knowledge distillation for speech emotion recognition

hierarchical network with decoupled knowledge distillation for speech emotion recognition

Data Augmentation with Unsupervised Speaking Style Transfer for Speech Emotion Recognition

SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic Speech Processing

Contact Info

Product

Resources

About