What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis

Gkoumas, Dimitris; Li, Qiuchi; Lioma, Christina; Yu, Yijun; Song, Dawei

doi:10.1016/j.inffus.2020.09.005

Cited by 68 publications

(20 citation statements)

References 67 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For this reason, we juxtapose the authors' results in order to acquire a sense of the actual differences between each architecture's ideas. For an empirical comparison of such architectures, which trains from scratch a wide range of the presented architectures, we refer to [88].…”

Section: Aggregated Reported Resultsmentioning

confidence: 99%

Deep Multimodal Emotion Recognition on Human Speech: A Review

Koromilas

Γιαννακόπουλος

2021

Applied Sciences

View full text Add to dashboard Cite

This work reviews the state of the art in multimodal speech emotion recognition methodologies, focusing on audio, text and visual information. We provide a new, descriptive categorization of methods, based on the way they handle the inter-modality and intra-modality dynamics in the temporal dimension: (i) non-temporal architectures (NTA), which do not significantly model the temporal dimension in both unimodal and multimodal interaction; (ii) pseudo-temporal architectures (PTA), which also assume an oversimplification of the temporal dimension, although in one of the unimodal or multimodal interactions; and (iii) temporal architectures (TA), which try to capture both unimodal and cross-modal temporal dependencies. In addition, we review the basic feature representation methods for each modality, and we present aggregated evaluation results on the reported methodologies. Finally, we conclude this work with an in-depth analysis of the future challenges related to validation procedures, representation learning and method robustness.

show abstract

Section: Aggregated Reported Resultsmentioning

confidence: 99%

Deep Multimodal Emotion Recognition on Human Speech: A Review

Koromilas

Γιαννακόπουλος

2021

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…For each model we reference both the original work and the one that evaluated the method on the MOSEI or IEMOCAP datasets. For detailed explanation and comparison of the aforementioned architectures, we refer the reader to detailed reviews on Multimodal Sentiment Analysis [21] and Multimodal Emotion Recognition [22].…”

Section: Results On Downstream Classificationmentioning

confidence: 99%

“…In [21] the authors retrained 11 of the most powerful and widely used models for Multimodal Language Analysis and list the number of parameters for some of them. However, this study, due to different pretraining, reports smaller amount of parameters for some models (eg.…”

Section: Model Complexitymentioning

confidence: 99%

Unsupervised Multimodal Language Representations using Convolutional Autoencoders

Koromilas¹,

Giannakopoulos²

2021

Preprint

View full text Add to dashboard Cite

Multimodal Language Analysis is a demanding area of research, since it is associated with two requirements: combining different modalities and capturing temporal information. During the last years, several works have been proposed in the area, mostly centered around supervised learning in downstream tasks. In this paper we propose extracting unsupervised Multimodal Language representations that are universal and can be applied to different tasks. Towards this end, we map the word-level aligned multimodal sequences to 2-D matrices and then use Convolutional Autoencoders to learn embeddings by combining multiple datasets. Extensive experimentation on Sentiment Analysis (MOSEI) and Emotion Recognition (IEMOCAP) indicate that the learned representations can achieve near-state-of-the-art performance with just the use of a Logistic Regression algorithm for downstream classification. It is also shown that our method is extremely lightweight and can be easily generalized to other tasks and unseen data with small performance drop and almost the same number of parameters. The proposed multimodal representation models are open-sourced and will help grow the applicability of Multimodal Language.

show abstract

“…23,24 Information may come from a single source, such as textual data or multiple sources, such as multimodal data. 25 They may be aggregated with simple means, such as the average, which is highly sensitive to outliers, or by more complex means, such as centroids, 26 interval-valued Pythagorean fuzzy numbers to include the correlation of a product's features, 21 or OWA operators, such as Induced Ordered Weighted Averaging (IOWA). 27 This paper presents a solution based on the distance defined in the lattice of hesitant linguistic terms.…”

Section: Introductionmentioning

confidence: 99%

“…23 Information fusion is not limited to ranking products, it can be applied to product recommendations, market analysis, product defect identification, 20 sentiment classification, 27 and video analysis. 25 Finally, multiple methods can be combined when an individual method may not be accurate enough. Specifically, this is used to improve the accuracy of lexicon-based methods for sentiment analysis by using cross-ratio uninorms.…”

Section: Introductionmentioning

confidence: 99%

Comparing global news sentiment using hesitant linguistic terms

Nguyen

Armisen

Agell

et al. 2021

Int J of Intelligent Sys

View full text Add to dashboard Cite

Global policy makers need to maintain a pulse on the state of play of global governance. Advances in analytical tools, such as global news dashboards, can provide current information on changes to global sentiment. In particular, identifying unexpected shifts in sentiment following a major news event may better inform stakeholders' actions. This paper defines a methodology to evaluate global sentiment for periods before, during, and after a major event. Each period's sentiment can be derived from news articles generated by news outlets. The sentiment is expressed in terms of hesitant linguistic terms to capture the range of sentiments articulated in each article. This representation is advantageous as it permits the interpretation of sentiments without conversion to numerical values. Sentiment from each article considered is aggregated into three centralized sentiments representing periods before, during, and after a particular event. This leads to a second enhancement to existing methods where the concept of a central opinion is represented in hesitant linguistic terms. Each of these sentiments is associated with a measure of consensus indicating the degree of agreement among the articles within their corresponding periods. A real case is presented for a noteworthy event in recent history. Three thousand three hundred fifty-two articles that referenced

show abstract

What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis

Cited by 68 publications

References 67 publications

Deep Multimodal Emotion Recognition on Human Speech: A Review

Deep Multimodal Emotion Recognition on Human Speech: A Review

Unsupervised Multimodal Language Representations using Convolutional Autoencoders

Comparing global news sentiment using hesitant linguistic terms

Contact Info

Product

Resources

About