Recent Transformer-based contextual word representations, including BERT and XLNet, have shown state-of-the-art performance in multiple disciplines within NLP. Fine-tuning the trained contextual models on task-specific datasets has been the key to achieving superior performance downstream. While finetuning these pre-trained models is straightforward for lexical applications (applications with only language modality), it is not trivial for multimodal language (a growing area in NLP focused on modeling face-to-face communication). Pre-trained models don't have the necessary components to accept two extra modalities of vision and acoustic. In this paper, we proposed an attachment to BERT and XLNet called Multimodal Adaptation Gate (MAG). MAG allows BERT and XL-Net to accept multimodal nonverbal data during fine-tuning. It does so by generating a shift to internal representation of BERT and XLNet; a shift that is conditioned on the visual and acoustic modalities. In our experiments, we study the commonly used CMU-MOSI and CMU-MOSEI datasets for multimodal sentiment analysis. Fine-tuning MAG-BERT and MAG-XLNet significantly boosts the sentiment analysis performance over previous baselines as well as language-only finetuning of BERT and XLNet. On the CMU-MOSI dataset, MAG-XLNet achieves humanlevel multimodal sentiment analysis performance for the first time in the NLP community.
Emotion is a cognitive process and is one of the important characteristics of human beings that makes them different from machines. Traditionally, interactions between humans and machines like computers do not exhibit any emotional exchanges. If we could build any system that is intelligent enough to interact with humans that involves emotions, that is, it can detect user emotions and change its behaviour accordingly, then using machines could be more effective and friendly. Many approaches have been taken to detect user emotions. Affective computing is the field that detects user emotion in a particular moment. Our approach in this paper is to detect user emotions by analysing the keyboard typing patterns of the user and the type of texts (words, sentences) typed by them. This combined analysis gives us a promising result showing a substantial number of emotional states detected from user input. Several machine learning algorithms were used to analyse keystroke timing attributes and text pattern. We have chosen keystroke because it is the cheapest and most available medium to interact with computers. We have considered seven emotional classes for classifying the emotional states. For text pattern analysis, we have used vector space model with Jaccard similarity method to classify free-text input. Our combined approach showed above 80% accuracies in identifying emotions.
Humor is a unique and creative communicative behavior often displayed during social interactions. It is produced in a multimodal manner, through the usage of words (text), gestures (visual) and prosodic cues (acoustic). Understanding humor from these three modalities falls within boundaries of multimodal language; a recent research trend in natural language processing that models natural language as it happens in face-to-face communication. Although humor detection is an established research area in NLP, in a multimodal context it has been understudied. This paper presents a diverse multimodal dataset, called UR-FUNNY, to open the door to understanding multimodal language used in expressing humor. The dataset and accompanying studies, present a framework in multimodal humor detection for the natural language processing community. UR-FUNNY is publicly available for research.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.