Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 2021
DOI: 10.18653/v1/2021.findings-acl.417
|View full text |Cite
|
Sign up to set email alerts
|

A Text-Centered Shared-Private Framework via Cross-Modal Prediction for Multimodal Sentiment Analysis

Abstract: Multimodal fusion is a core problem for multimodal sentiment analysis. Previous works usually treat all three modal features equally and implicitly explore the interactions between different modalities. In this paper, we break this kind of methods in two ways. Firstly, we observe that textual modality plays the most important role in multimodal sentiment analysis, and this can be seen from the previous works. Secondly, we observe that comparing to the textual modality, the other two kinds of nontextual modalit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 63 publications
(8 citation statements)
references
References 16 publications
1
5
0
Order By: Relevance
“…This strategy has advantage of fusing features by calculating the transformations between two input features, but ignores the relationships between class structures in the dataset. Based on the idea of weighted summation, a lot of study introduce attention mechanisms ( Vielzeuf et al, 2018 ; Liu et al, 2019 ; Wu et al, 2021 ), self-attention mechanisms ( Liu et al, 2021b ), and gating mechanisms ( Pu et al, 2020 ) into feature layers to reflect the different contributions of feature layers with different input resolutions to the final result. In view of the inconsistent quality, complex information, and obvious individual differences of images, our own designed wFPN showed significantly better performance than previous deep-learning models.…”
Section: Discussionmentioning
confidence: 99%
“…This strategy has advantage of fusing features by calculating the transformations between two input features, but ignores the relationships between class structures in the dataset. Based on the idea of weighted summation, a lot of study introduce attention mechanisms ( Vielzeuf et al, 2018 ; Liu et al, 2019 ; Wu et al, 2021 ), self-attention mechanisms ( Liu et al, 2021b ), and gating mechanisms ( Pu et al, 2020 ) into feature layers to reflect the different contributions of feature layers with different input resolutions to the final result. In view of the inconsistent quality, complex information, and obvious individual differences of images, our own designed wFPN showed significantly better performance than previous deep-learning models.…”
Section: Discussionmentioning
confidence: 99%
“…Utterances with the same prediction results are labeled as M ASK v , while M ASK a is obtained using a similar method. We omit M ASK t as modality t already exhibits superior performance in ERC (Wu et al 2021).…”
Section: Grummentioning
confidence: 99%
“…In Han et al ( 2021 ), the authors proposed a Transformer-based bi-bimodal fusion network, consisting of two text-related complementing modules, to separately fuse textual feature sequence with audio and visual feature sequences. In Wu et al ( 2021 ), two cross-modal prediction modules, i.e., text-to-visual and text-to-audio models, were designed to decouple the shared and private information of non-textual modalities compared to the textual modality. The shared non-textual information was used to enrich the semantics of textual features and the private non-textual features were later fused with the enhanced textual features through a regression layer for final prediction.…”
Section: Related Workmentioning
confidence: 99%
“…Several works proposed to first fuse the audio and visual feature sequences into a higher level space, then fuse this bimodal feature sequence with the textual feature sequence (Fu et al, 2022 ; Zhang et al, 2022 ). Alternatively, text-centered frameworks were designed to explore the cross-modal interactions between textual and non-textual feature sequences (Han et al, 2021 ; He and Hu, 2021 ; Wu et al, 2021 ). In the works above, the textual features are feature sequences composed of the word-level embeddings.…”
Section: Introductionmentioning
confidence: 99%