2020
DOI: 10.48550/arxiv.2008.06682
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Jointly Fine-Tuning "BERT-like" Self Supervised Models to Improve Multimodal Speech Emotion Recognition

Abstract: Multimodal emotion recognition from speech is an important area in affective computing. Fusing multiple data modalities and learning representations with limited amounts of labeled data is a challenging task. In this paper, we explore the use of modality specific"BERT-like" pretrained Self Supervised Learning (SSL) architectures to represent both speech and text modalities for the task of multimodal speech emotion recognition. By conducting experiments on three publicly available datasets (IEMOCAP, CMU-MOSEI, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
37
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 17 publications
(38 citation statements)
references
References 33 publications
1
37
0
Order By: Relevance
“…Eventually, we examine TEASEL against networks which fine-tune a Transformer for the downstream task. Our method has outperformed Self-MM [7], MAG-BERT, MAG-XLNet [33], and Shallow-fusion [34] methods in most metrics. Unlike MAG-Transformer methods, TEASEL does not require an aligned feature and explicitly feeds speech representations to the Transformer.…”
Section: Quantitative Analysismentioning
confidence: 95%
See 2 more Smart Citations
“…Eventually, we examine TEASEL against networks which fine-tune a Transformer for the downstream task. Our method has outperformed Self-MM [7], MAG-BERT, MAG-XLNet [33], and Shallow-fusion [34] methods in most metrics. Unlike MAG-Transformer methods, TEASEL does not require an aligned feature and explicitly feeds speech representations to the Transformer.…”
Section: Quantitative Analysismentioning
confidence: 95%
“…[33] have proposed a method to fuse other modalities in the middle layer of a pre-trained Transformer-based language model in an aligned manner using a Multimodal Adaption Gate (MAG) module. Later, with the popularity of Transformerbased models in Speech, [34] has examined jointly finetuning lexicon and speech Transformer on the multimodal language task. They implemented Co-Attention fusions and Shallow-Fusion using an attentive and a straightforward late fusion of two BERT-style [11] Transformers, respectively.…”
Section: Human Multimodal Language Analysismentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, several models for automatic speech recognition (ASR) which use self-supervised pretraining have been released, including wav2vec [23] and VQ-wav2vec [24]. A few recent studies [25,26,27] have successfully applied representations from these models as features for emotion recognition.…”
Section: Introductionmentioning
confidence: 99%
“…feature rich yet efficient representations (Zadeh et al 2017;Liu et al 2018;Hazarika, Zimmermann, and Poria 2020). Recently (Rahman et al 2020) used pre-trained transformer (Tsai et al 2019;Siriwardhana et al 2020) based models to achieve state-of the-art results on multimodal sentiment benchmark MOSI (Wöllmer et al 2013) and MOSEI (Zadeh et al 2018c).…”
Section: Introductionmentioning
confidence: 99%