Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1212
|View full text |Cite
|
Sign up to set email alerts
|

Jointly Fine-Tuning “BERT-Like” Self Supervised Models to Improve Multimodal Speech Emotion Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
49
1
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
2
2

Relationship

0
10

Authors

Journals

citations
Cited by 68 publications
(52 citation statements)
references
References 0 publications
1
49
1
1
Order By: Relevance
“…As a result, we achieve a state-of-the-art result on FSC and good accuracy on other datasets, SNIPS and Smartlights. We leave the extension of our method to other downstream tasks such as speech emotion recognition [27] and spoken question answering [28] as future work.…”
Section: Discussionmentioning
confidence: 99%
“…As a result, we achieve a state-of-the-art result on FSC and good accuracy on other datasets, SNIPS and Smartlights. We leave the extension of our method to other downstream tasks such as speech emotion recognition [27] and spoken question answering [28] as future work.…”
Section: Discussionmentioning
confidence: 99%
“…[Siriwardhana et al] [63] searched the use of the pretrained "BERT-like" architecture for self-supervised learning (SSL) to both represent language and text modalities in order to recognize the multimodal language emotions. They demonstrate that a basic fusion mechanism (Shallow-Fusion) simplifies the overall structure and strengthens complex fusion mechanisms.…”
Section: A Multimodal Emotion Recognition Combining (Audiomentioning
confidence: 99%
“…In Tsai et al (2019a), the authors used multiple cross-modal Transformers and applies late fusion to obtain trimodal features, resulting in a large amount of parameters needed to retain original modality information. Other works that also used cross-modal Transformer architecture for include ; Siriwardhana et al (2020). In contrast to the existing works, our proposed graph method, with very small amount of model parameters, can aggregate information from multiple (more than 2) modalities at early stage by building edges between the corresponding modalities, allowing richer and more complex representation of the interactions to be learned.…”
Section: Related Workmentioning
confidence: 99%