Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis

Han, Wei; Chen, Hui; Gelbukh, Alexander; Zadeh, Amir; Morency, Louis–Philippe; Poria, Soujanya

doi:10.1145/3462244.3479919

Cited by 127 publications

(41 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• LMF: The model uses tensors to explore the interactions between modes and uses low-rank decomposition to alleviate the problem of number of parameters. • MFM: To enhance the robustness of the model of capturing intra-and inter modality dynamics, MFM is a cycle style generative-discriminative model [14]. • MulT: Multimodal Transformer constructs an architecture unimodal and crossmodal transformer networks and complete fusion process by attention [15].…”

Section: Methodsmentioning

confidence: 99%

“…The CMU-MOSEI dataset is an upgraded version of CMU-MOSI concerning the number of samples. It is also enriched in terms of the versatility of speakers and covers a broader scope of topics [14].…”

Section: Datasetsmentioning

confidence: 99%

“…Deep learning based multimodal fusion algorithms can be classified according to the level of fusion: pixel level, feature level and decision level. Most of the advanced fusion methods are currently based on feature level, such as the MISA model for modality-invariant and -specific representations [13], the BBFN model for bi-bimodal modality fusion for correlation-controlled [14], and the MMIM model for improving multimodal fusion with hierarchical mutual information maximization [15]. The MISA-CT model proposed in this paper is based on the MISA model, which follows the basic process of MISA for multimodal data, mapping the original modal data to modality-invariant and -specific subspaces respectively.…”

Section: Introductionmentioning

confidence: 99%

“…It can be intuitively seen that the replication results of each paper for the MISA model are not exactly the same, and different hardware devices and software versions make the experimental results of the model not fully replicated, but the experimental results differ very little. In order to increase the credibility of the experiments, a set of experimental results of MISA in paper[14] and the best MISA-CT model in this paper are added to compare the experimental results, as shown in Table2. Replication results of different papers for the MISA model.…”

mentioning

confidence: 99%

See 3 more Smart Citations

Modality-Invariant and -Specific Representations with Crossmodal Transformer for Multimodal Sentiment Analysis

Shan

Wei

Cai

2022

J. Phys.: Conf. Ser.

View full text Add to dashboard Cite

Human emotion judgments usually receive information from multiple modalities such as language, audio, as well as facial expressions and gestures. Because different modalities are represented differently, multimodal data exhibit redundancy and complementarity, so a reasonable multimodal fusion approach is essential to improve the accuracy of sentiment analysis. Inspired by the Crossmodal Transformer for multimodal data fusion in the MulT (Multimodal Transformer) model, this paper adds the Crossmodal transformer for modal enhancement of different modal data in the fusion part of the MISA (Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis) model, and proposes three MISA-CT models. Tested on two publicly available multimodal sentiment analysis datasets MOSI and MOSEI, the experimental results of the models outperformed the original MISA model.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Datasetsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Modality-Invariant and -Specific Representations with Crossmodal Transformer for Multimodal Sentiment Analysis

Shan

Wei

Cai

2022

J. Phys.: Conf. Ser.

View full text Add to dashboard Cite

show abstract

“…Most of the high performance of existing models are dependent on a great number of learnable parameters [ 15 , 16 ], ignoring the potential application in some promising areas like human–computer interaction (HCI), which requires real-time and light models. Thus, a lightweight model is necessary to improve the feasibility and practicability of the application of speech emotion recognition.…”

Section: Introductionmentioning

confidence: 99%

LGCCT: A Light Gated and Crossed Complementation Transformer for Multimodal Speech Emotion Recognition

Liu

Shen

et al. 2022

Entropy

View full text Add to dashboard Cite

Semantic-rich speech emotion recognition has a high degree of popularity in a range of areas. Speech emotion recognition aims to recognize human emotional states from utterances containing both acoustic and linguistic information. Since both textual and audio patterns play essential roles in speech emotion recognition (SER) tasks, various works have proposed novel modality fusing methods to exploit text and audio signals effectively. However, most of the high performance of existing models is dependent on a great number of learnable parameters, and they can only work well on data with fixed length. Therefore, minimizing computational overhead and improving generalization to unseen data with various lengths while maintaining a certain level of recognition accuracy is an urgent application problem. In this paper, we propose LGCCT, a light gated and crossed complementation transformer for multimodal speech emotion recognition. First, our model is capable of fusing modality information efficiently. Specifically, the acoustic features are extracted by CNN-BiLSTM while the textual features are extracted by BiLSTM. The modality-fused representation is then generated by the cross-attention module. We apply the gate-control mechanism to achieve the balanced integration of the original modality representation and the modality-fused representation. Second, the degree of attention focus can be considered, as the uncertainty and the entropy of the same token should converge to the same value independent of the length. To improve the generalization of the model to various testing-sequence lengths, we adopt the length-scaled dot product to calculate the attention score, which can be interpreted from a theoretical view of entropy. The operation of the length-scaled dot product is cheap but effective. Experiments are conducted on the benchmark dataset CMU-MOSEI. Compared to the baseline models, our model achieves an 81.0% F1 score with only 0.432 M parameters, showing an improvement in the balance between performance and the number of parameters. Moreover, the ablation study signifies the effectiveness of our model and its scalability to various input-sequence lengths, wherein the relative improvement is almost 20% of the baseline without a length-scaled dot product.

show abstract

Bi-attention Modal Separation Network for Multimodal Video Fusion

Gao

2022

MultiMedia Modeling

View full text Add to dashboard Cite

Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis

Cited by 127 publications

References 37 publications

Modality-Invariant and -Specific Representations with Crossmodal Transformer for Multimodal Sentiment Analysis

Modality-Invariant and -Specific Representations with Crossmodal Transformer for Multimodal Sentiment Analysis

LGCCT: A Light Gated and Crossed Complementation Transformer for Multimodal Speech Emotion Recognition

Bi-attention Modal Separation Network for Multimodal Video Fusion

Contact Info

Product

Resources

About