Video sentiment analysis with bimodal information-augmented multi-head attention

Wu, Tung Ying; Peng, Junjie; Zhang, Wenqiang; Zhang, Huiran; Tan, Shuhua; Yi, Fen; Ma, Chuanshuai; Huang, Yansong

doi:10.1016/j.knosys.2021.107676

Cited by 62 publications

(11 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In sentiment analysis, the most used modalities include text and transcribed audio [2]. Additionally, sentiment analysis can be conducted on images [69] and multi-modal data, such as videos combining image, audio and text [68], or social media posts that feature both images and text [31]. Sentiment analysis techniques can also be applied to process dynamic data for real-time information extraction [70].…”

Section: Impact Analysis Of Prompting Methods and Data Characteristic...mentioning

confidence: 99%

“…Finally, building on our study focusing on the data modality of text for sentiment analysis, future research should investigate the performance of Generative AI in different data modalities. The emerging fields of image-based sentiment analysis [69] and multimodal analysis of videos [68], which combine image and audio, offer a wide range of possibilities for deeper and more differentiated sentiment extraction. Additionally, the application of sentiment analysis to real-time data streams such as social media posts [31] and live-streamed comments [13] presents opportunities for leveraging Generative AI in capturing immediate public sentiment.…”

Section: Note: the Table Records Average Accuracy Per Feature Without...mentioning

confidence: 99%

See 1 more Smart Citation

Sentiment Analysis in the Age of Generative AI

Krugmann,

Hartmann

2024

Cust. Need. and Solut.

View full text Add to dashboard Cite

In the rapidly advancing age of Generative AI, Large Language Models (LLMs) such as ChatGPT stand at the forefront of disrupting marketing practice and research. This paper presents a comprehensive exploration of LLMs’ proficiency in sentiment analysis, a core task in marketing research for understanding consumer emotions, opinions, and perceptions. We benchmark the performance of three state-of-the-art LLMs, i.e., GPT-3.5, GPT-4, and Llama 2, against established, high-performing transfer learning models. Despite their zero-shot nature, our research reveals that LLMs can not only compete with but in some cases also surpass traditional transfer learning methods in terms of sentiment classification accuracy. We investigate the influence of textual data characteristics and analytical procedures on classification accuracy, shedding light on how data origin, text complexity, and prompting techniques impact LLM performance. We find that linguistic features such as the presence of lengthy, content-laden words improve classification performance, while other features such as single-sentence reviews and less structured social media text documents reduce performance. Further, we explore the explainability of sentiment classifications generated by LLMs. The findings indicate that LLMs, especially Llama 2, offer remarkable classification explanations, highlighting their advanced human-like reasoning capabilities. Collectively, this paper enriches the current understanding of sentiment analysis, providing valuable insights and guidance for the selection of suitable methods by marketing researchers and practitioners in the age of Generative AI.

show abstract

Section: Impact Analysis Of Prompting Methods and Data Characteristic...mentioning

confidence: 99%

Section: Note: the Table Records Average Accuracy Per Feature Without...mentioning

confidence: 99%

Sentiment Analysis in the Age of Generative AI

Krugmann,

Hartmann

2024

Cust. Need. and Solut.

View full text Add to dashboard Cite

show abstract

“…We compare our model with the following state of the art (SOTA) works where the audio, visual and text modalities are considered: (1) Late Fusion LSTM (LF-LSTM), where each modality uses an individual LSTM to extract global features followed by an MLP for unimodal decision, and the final prediction is obtained by weighted fusion; (2) Late Fusion Transformer (LF-TRANS) which is similar to LF-LSTM except that the Transformer models are used instead of LSTMs to model the temporal dependency for each modality; (3) EmoEmbs (Dai et al, 2020 ) where three LSTMs are adopted to obtain the global features for each modality and generates modality-specific emotion embeddings through mapping the GloVe textual emotion embeddings to the non-textual modalities respectively, and finally the similarity scores between the emotion embedding and the global features are calculated and fused to get the final prediction; (4) MulT (Tsai et al, 2019 ) that employs six cross-modal attention modules for any two pairs of the three modalities, and then three self-attention modules to collect temporal information within each modality. Finally the concatenated features are passed through the fully-connected layers to make predictions; (5) BIMHA (Wu et al, 2022 ) mainly consists of two parts: inter-modal interaction and inter-bimodal interaction, where the outer product is first used to represent three pairs of bimodal global features and then the bimodal attention is calculated via an extended multi-head attention mechanism; (6) CMHA (Zheng et al, 2022 ) where the core is connecting multiple multi-head attention modules in series, to model the interactions between two unimodal feature sequences first and then with the third one. Additionally, the sequential order of modality fusion is considered, resulting in three similar fusion modules but in different orders of fusion; (7) FE2E (Dai et al, 2021 ) which is a fully end-to-end framework, where the textual features are extracted from a pre-trained ALBERT model and the audio and visual features are extracted from two pre-trained CNNs, each followed by a Transformer to encode the sequential representations, and then three MLPs are adopted to make unimodal decision and weighted fusion is performed to output predictions; (8) MESM (Dai et al, 2021 ) which is similar to FE2E, except that the original CNN layers are replaced with cross-modal sparse CNN blocks to reduce the computational overhead.…”

Section: Methodsmentioning

confidence: 99%

Multimodal interaction enhanced representation learning for video emotion recognition

2022

View full text Add to dashboard Cite

Video emotion recognition aims to infer human emotional states from the audio, visual, and text modalities. Previous approaches are centered around designing sophisticated fusion mechanisms, but usually ignore the fact that text contains global semantic information, while speech and face video show more fine-grained temporal dynamics of emotion. From the perspective of cognitive sciences, the process of emotion expression, either through facial expression or speech, is implicitly regulated by high-level semantics. Inspired by this fact, we propose a multimodal interaction enhanced representation learning framework for emotion recognition from face video, where a semantic enhancement module is first designed to guide the audio/visual encoder using the semantic information from text, then the multimodal bottleneck Transformer is adopted to further reinforce the audio and visual representations by modeling the cross-modal dynamic interactions between the two feature sequences. Experimental results on two benchmark emotion databases indicate the superiority of our proposed method. With the semantic enhanced audio and visual features, it outperforms the state-of-the-art models which fuse the features or decisions from the audio, visual and text modalities.

show abstract

“…While text data are the most common object of study in SA applications [9], these can also be based on the processing of images, sound, video, or a combination of these data types [52]- [54]. Thus, the specific methodologies and content sources for the data collection step vary depending on the types to be analyzed.…”

Section: A Data Collection Techniquesmentioning

confidence: 99%