2022
DOI: 10.48550/arxiv.2203.13504
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

EmoCaps: Emotion Capsule based Model for Conversational Emotion Recognition

Abstract: Emotion recognition in conversation (ERC) aims to analyze the speaker's state and identify their emotion in the conversation. Recent works in ERC focus on context modeling but ignore the representation of contextual emotional tendency. In order to extract multimodal information and the emotional tendency of the utterance effectively, we propose a new structure named Emoformer to extract multimodal emotion vectors from different modalities and fuse them with sentence vector to be an emotion capsule. Furthermore… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 6 publications
(10 citation statements)
references
References 10 publications
0
10
0
Order By: Relevance
“…Therefore, learning one after the other in a sequence may not give an accurate representation of these features. In addition, though all modalities consist of emotional cues pertinent to emotion recognition, the uttered sentences in the text modality consist of grammatical and semantic features [8] which can provide supplementary knowledge to the model for effective and robust SER. In [37], a model was suggested that uses multi-channel convolutional neural networks (MCNNs) to learn emotional and grammatical features from the text.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…Therefore, learning one after the other in a sequence may not give an accurate representation of these features. In addition, though all modalities consist of emotional cues pertinent to emotion recognition, the uttered sentences in the text modality consist of grammatical and semantic features [8] which can provide supplementary knowledge to the model for effective and robust SER. In [37], a model was suggested that uses multi-channel convolutional neural networks (MCNNs) to learn emotional and grammatical features from the text.…”
Section: Related Workmentioning
confidence: 99%
“…In [37], a model was suggested that uses multi-channel convolutional neural networks (MCNNs) to learn emotional and grammatical features from the text. However, as asserted in [8], CNNs and RNNs can weakly extract grammatical and semantic information since they are good at learning spatial and temporal features but not the context of the sequences. We therefore propose a model that concurrently learns spatial, temporal and semantic features in the LFLB and their representations are fused and fed into the GFLB.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations