Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua 2021
DOI: 10.18653/v1/2021.naacl-main.417
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal End-to-End Sparse Model for Emotion Recognition

Abstract: Existing works on multimodal affective computing tasks, such as emotion recognition, generally adopt a two-phase pipeline, first extracting feature representations for each single modality with hand-crafted algorithms and then performing end-to-end learning with the extracted features. However, the extracted features are fixed and cannot be further fine-tuned on different target tasks, and manually finding feature extraction algorithms does not generalize or scale well to different tasks, which can lead to sub… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
57
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
4
1

Relationship

3
7

Authors

Journals

citations
Cited by 45 publications
(57 citation statements)
references
References 31 publications
0
57
0
Order By: Relevance
“…Recently, many studies have been performed on multimodal learning (Mroueh et al, 2015;Antol et al, 2015;Donahue et al, 2015;Zadeh et al, 2017;Dai et al, , 2021. However, only a few have investigated MAS.…”
Section: Multimodal Abstractive Summarizationmentioning
confidence: 99%
“…Recently, many studies have been performed on multimodal learning (Mroueh et al, 2015;Antol et al, 2015;Donahue et al, 2015;Zadeh et al, 2017;Dai et al, , 2021. However, only a few have investigated MAS.…”
Section: Multimodal Abstractive Summarizationmentioning
confidence: 99%
“…Most of the previous studies adopt a two-phase pipeline, first extracting unimodal features and then fusing them. Dai et al (2021) considered that it may lead to suboptimal performance since the extracted unimodal features are fixed and cannot be further improved benefiting from the downstream supervisory signals. Therefore, they proposed the multimodal endto-end sparse model, which can optimize the unimodal feature extraction and multimodal feature fusion jointly.…”
Section: Related Workmentioning
confidence: 99%
“…For example, quite a few works focus on the fusion of modalities, such as the Tensor Fusion Network , Memory Fusion Network (Zadeh et al, 2018a), Multimodal Adaptation Gate (Rahman et al, 2020). Additionally, Multimodal Transformer (Tsai et al, 2019) was introduced to handle unaligned data, Dai et al (2020a) proposed to use emotional embeddings to enable zero-/few-shot learning for lowresource senarios, and Dai et al (2021) introduced the sparse cross-attention to improve performance and reduce computation. Despite the remarkbale progress has been made, we find that most models suffer from the small scale of data on these tasks.…”
Section: Related Workmentioning
confidence: 99%