Integrating Multimodal Information in Large Pretrained Transformers

Rahman, Wasifur; Hasan, Md. Kamrul; Lee, Sangwu; Zadeh, Amir; Mao, Chengfeng; Morency, Louis–Philippe; Hoque, Ehsan

doi:10.48550/arxiv.1908.05787

Cited by 6 publications

(3 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…MAG-BERT. The multimodal adaptation gate for BERT uses a gate structure connected to the BERT model to continuously improve the multimodal recognition accuracy of the model by modifying the BERT model with attention and adaptive vectors conditional on non-verbal behavior [38]. The AFR-BERT model consists of the following four main components.…”

Section: Plos Onementioning

confidence: 99%

AFR-BERT: Attention-based mechanism feature relevance fusion multimodal sentiment analysis model

Zhou

Wei

2022

PLoS ONE

View full text Add to dashboard Cite

Multimodal sentiment analysis is an essential task in natural language processing which refers to the fact that machines can analyze and recognize emotions through logical reasoning and mathematical operations after learning multimodal emotional features. For the problem of how to consider the effective fusion of multimodal data and the relevance of multimodal data in multimodal sentiment analysis, we propose an attention-based mechanism feature relevance fusion multimodal sentiment analysis model (AFR-BERT). In the data pre-processing stage, text features are extracted using the pre-trained language model BERT (Bi-directional Encoder Representation from Transformers), and the BiLSTM (Bi-directional Long Short-Term Memory) is used to obtain the internal information of the audio. In the data fusion phase, the multimodal data fusion network effectively fuses multimodal features through the interaction of text and audio information. During the data analysis phase, the multimodal data association network analyzes the data by exploring the correlation of fused information between text and audio. In the data output phase, the model outputs the results of multimodal sentiment analysis. We conducted extensive comparative experiments on the publicly available sentiment analysis datasets CMU-MOSI and CMU-MOSEI. The experimental results show that AFR-BERT improves on the classical multimodal sentiment analysis model in terms of relevant performance metrics. In addition, ablation experiments and example analysis show that the multimodal data analysis network in AFR-BERT can effectively capture and analyze the sentiment features in text and audio.

show abstract

Section: Plos Onementioning

confidence: 99%

AFR-BERT: Attention-based mechanism feature relevance fusion multimodal sentiment analysis model

Zhou

Wei

2022

PLoS ONE

View full text Add to dashboard Cite

show abstract

“…The goal of multimodal sentiment analysis is to regress or classify the overall sentiment of an utterance using acoustic, visual, and language cues. Because multimodal sentiment analysis is a large and well-established field, we direct the reader to [2,21,29] for an overview of the field, and MISA [8], MAG [31], and M3ER [20] as representative of recent state of the art works. We restrict our scope to describing differences and similarities between our setting and the classical multimodal sentiment analysis setting.…”

Section: Related Work 21 Multimodal Sentiment Classificationmentioning

confidence: 99%

Exploiting BERT for Multimodal Target Sentiment Classification through Input Space Translation

Khan

2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

“…Besides, neural networks raise more attention in fusion especially since the appearance of RNN and LSTM [36,47]. More recently, transformer-based [51] fusion raises growing attention [1,48,37,16,21], especially after its application in vision [7]. In addition to that, there are also some modelagnostic fusion methods, including the simple concatenation [27,6,58] and element-wise operation [8,50].…”

Section: Related Workmentioning

confidence: 99%

Audio-Visual Transformer Based Crowd Counting

Sajid

Chen

Sajid

et al. 2021

Preprint

View full text Add to dashboard Cite

Crowd estimation is a very challenging problem. The most recent study tries to exploit auditory information to aid the visual models, however, the performance is limited due to the lack of an effective approach for feature extraction and integration. The paper proposes a new audiovisual multi-task network to address the critical challenges in crowd counting by effectively utilizing both visual and audio inputs for better modalities association and productive feature extraction. The proposed network introduces the notion of auxiliary and explicit image patch-importance ranking (PIR) and patch-wise crowd estimate (PCE) information to produce a third (run-time) modality. These modalities (audio, visual, run-time) undergo a transformer-inspired cross-modality co-attention mechanism to finally output the crowd estimate. To acquire rich visual features, we propose a multi-branch structure with transformer-style fusion in-between. Extensive experimental evaluations show that the proposed scheme outperforms the state-of-the-art networks under all evaluation settings with up to 33.8% improvement. We also analyze and compare the vision-only variant of our network and empirically demonstrate its superiority over previous approaches.

show abstract

Integrating Multimodal Information in Large Pretrained Transformers

Cited by 6 publications

References 0 publications

AFR-BERT: Attention-based mechanism feature relevance fusion multimodal sentiment analysis model

AFR-BERT: Attention-based mechanism feature relevance fusion multimodal sentiment analysis model

Exploiting BERT for Multimodal Target Sentiment Classification through Input Space Translation

Audio-Visual Transformer Based Crowd Counting

Contact Info

Product

Resources

About