2022
DOI: 10.1007/978-3-031-00129-1_1
|View full text |Cite
|
Sign up to set email alerts
|

Emotion-Aware Multimodal Pre-training for Image-Grounded Emotional Response Generation

Abstract: Face-to-face communication leads to better interactions between speakers than text-to-text conversations since the speakers can capture both textual and visual signals. Image-grounded emotional response generation (IgERG) tasks requires chatbots to generate a response with the understanding of both textual contexts and speakers' emotions in visual signals. Pre-training models enhance many NLP and CV tasks and image-text pre-training also helps multimodal tasks. However, existing image-text pre-training methods… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(2 citation statements)
references
References 55 publications
(82 reference statements)
0
2
0
Order By: Relevance
“…In their work, emotions were classifed into two broad categories, namely, positive and negative, to facilitate a simplifed emotional understanding. In a distinct study, Tian et al [10] put forth a multitask learning framework in which tasks such as image sentiment sequential labeling, image sentiment classifcation, and text generation were learned simultaneously. Tis was accomplished using a pretrained model specifcally designed to generate textual content that efectively captures the user's emotions.…”
Section: Emotional Dialogue Systemmentioning
confidence: 99%
See 1 more Smart Citation
“…In their work, emotions were classifed into two broad categories, namely, positive and negative, to facilitate a simplifed emotional understanding. In a distinct study, Tian et al [10] put forth a multitask learning framework in which tasks such as image sentiment sequential labeling, image sentiment classifcation, and text generation were learned simultaneously. Tis was accomplished using a pretrained model specifcally designed to generate textual content that efectively captures the user's emotions.…”
Section: Emotional Dialogue Systemmentioning
confidence: 99%
“…In the realm of academia, researchers have extensively investigated dialogue models, such as those presented in Shuster et al [4,5], and have proposed emotion-enhanced models, as discussed in Wei et al [6] and Li et al [7]. Specifcally, to address the limitations of single text generation models, multimodal dialogue models capable of processing both textual and video information have been proposed, including the works of Fung et al [8], Huber et al [9], and Tian et al [10]. More importantly, Shen et al [11] designed ViDA-MAN, a digital human agent for multimodal interaction, which provides real-time audiovisual responses to users through voice queries.…”
Section: Introductionmentioning
confidence: 99%