2022
DOI: 10.1016/j.knosys.2021.107676
|View full text |Cite
|
Sign up to set email alerts
|

Video sentiment analysis with bimodal information-augmented multi-head attention

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 62 publications
(11 citation statements)
references
References 53 publications
0
11
0
Order By: Relevance
“…In sentiment analysis, the most used modalities include text and transcribed audio [2]. Additionally, sentiment analysis can be conducted on images [69] and multi-modal data, such as videos combining image, audio and text [68], or social media posts that feature both images and text [31]. Sentiment analysis techniques can also be applied to process dynamic data for real-time information extraction [70].…”
Section: Impact Analysis Of Prompting Methods and Data Characteristic...mentioning
confidence: 99%
See 1 more Smart Citation
“…In sentiment analysis, the most used modalities include text and transcribed audio [2]. Additionally, sentiment analysis can be conducted on images [69] and multi-modal data, such as videos combining image, audio and text [68], or social media posts that feature both images and text [31]. Sentiment analysis techniques can also be applied to process dynamic data for real-time information extraction [70].…”
Section: Impact Analysis Of Prompting Methods and Data Characteristic...mentioning
confidence: 99%
“…Finally, building on our study focusing on the data modality of text for sentiment analysis, future research should investigate the performance of Generative AI in different data modalities. The emerging fields of image-based sentiment analysis [69] and multimodal analysis of videos [68], which combine image and audio, offer a wide range of possibilities for deeper and more differentiated sentiment extraction. Additionally, the application of sentiment analysis to real-time data streams such as social media posts [31] and live-streamed comments [13] presents opportunities for leveraging Generative AI in capturing immediate public sentiment.…”
Section: Note: the Table Records Average Accuracy Per Feature Without...mentioning
confidence: 99%
“…We compare our model with the following state of the art (SOTA) works where the audio, visual and text modalities are considered: (1) Late Fusion LSTM (LF-LSTM), where each modality uses an individual LSTM to extract global features followed by an MLP for unimodal decision, and the final prediction is obtained by weighted fusion; (2) Late Fusion Transformer (LF-TRANS) which is similar to LF-LSTM except that the Transformer models are used instead of LSTMs to model the temporal dependency for each modality; (3) EmoEmbs (Dai et al, 2020 ) where three LSTMs are adopted to obtain the global features for each modality and generates modality-specific emotion embeddings through mapping the GloVe textual emotion embeddings to the non-textual modalities respectively, and finally the similarity scores between the emotion embedding and the global features are calculated and fused to get the final prediction; (4) MulT (Tsai et al, 2019 ) that employs six cross-modal attention modules for any two pairs of the three modalities, and then three self-attention modules to collect temporal information within each modality. Finally the concatenated features are passed through the fully-connected layers to make predictions; (5) BIMHA (Wu et al, 2022 ) mainly consists of two parts: inter-modal interaction and inter-bimodal interaction, where the outer product is first used to represent three pairs of bimodal global features and then the bimodal attention is calculated via an extended multi-head attention mechanism; (6) CMHA (Zheng et al, 2022 ) where the core is connecting multiple multi-head attention modules in series, to model the interactions between two unimodal feature sequences first and then with the third one. Additionally, the sequential order of modality fusion is considered, resulting in three similar fusion modules but in different orders of fusion; (7) FE2E (Dai et al, 2021 ) which is a fully end-to-end framework, where the textual features are extracted from a pre-trained ALBERT model and the audio and visual features are extracted from two pre-trained CNNs, each followed by a Transformer to encode the sequential representations, and then three MLPs are adopted to make unimodal decision and weighted fusion is performed to output predictions; (8) MESM (Dai et al, 2021 ) which is similar to FE2E, except that the original CNN layers are replaced with cross-modal sparse CNN blocks to reduce the computational overhead.…”
Section: Methodsmentioning
confidence: 99%
“…While text data are the most common object of study in SA applications [9], these can also be based on the processing of images, sound, video, or a combination of these data types [52]- [54]. Thus, the specific methodologies and content sources for the data collection step vary depending on the types to be analyzed.…”
Section: A Data Collection Techniquesmentioning
confidence: 99%