2021
DOI: 10.3390/s21248356
|View full text |Cite
|
Sign up to set email alerts
|

AttendAffectNet–Emotion Prediction of Movie Viewers Using Multimodal Fusion with Self-Attention

Abstract: In this paper, we tackle the problem of predicting the affective responses of movie viewers, based on the content of the movies. Current studies on this topic focus on video representation learning and fusion techniques to combine the extracted features for predicting affect. Yet, these typically, while ignoring the correlation between multiple modality inputs, ignore the correlation between temporal inputs (i.e., sequential features). To explore these correlations, a neural network architecture—namely AttendA… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
10
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
2
1
1

Relationship

2
6

Authors

Journals

citations
Cited by 13 publications
(10 citation statements)
references
References 104 publications
(202 reference statements)
0
10
0
Order By: Relevance
“…The main goal of multimodal fusion is to reduce the heterogeneous differences among modalities, [17] keep the integrity of specific semantics of each modality, and achieve the best performance in deep learning models. It is divided into three types: joint architecture, cooperative architecture and codec architecture.…”
Section: Fig 4 Heterogeneous Integration Of Multimedia Information In...mentioning
confidence: 99%
“…The main goal of multimodal fusion is to reduce the heterogeneous differences among modalities, [17] keep the integrity of specific semantics of each modality, and achieve the best performance in deep learning models. It is divided into three types: joint architecture, cooperative architecture and codec architecture.…”
Section: Fig 4 Heterogeneous Integration Of Multimedia Information In...mentioning
confidence: 99%
“…Reviews have summarized extracted features relevant to affect detection in the audio modality such as intensity (loudness, energy), timbre (MFCC) and rhythm (tempo, regularity) features [31] and video modality such as colour, lighting key, motion intensity and shot length [35], [36]. Features that can capture complex latent dimensions in the data, such as the audio embeddings generated by the VGGish model [37] [38] [39], are also becoming increasingly popular. Features may be provided in the dataset or extracted from the source data if it is available.…”
Section: Datasets For Affective Multimedia Content Analysismentioning
confidence: 99%
“…Audio feature extraction was performed with openS-MILE [46], a popular open-source library for audio feature extraction. Specifically, we used the "emobase" configuration file to extract a set of 988 low-level descriptors (LLDs) including MFCC, pitch, spectral, zero-crossing rate, loudness and intensity statistics, many of which have been shown to be effective for identifying emotion in music [38], [39], [47], [48]. Many other configurations are available in openSMILE but we provide the "emobase" set of acoustic features since it is well-documented and was designed for emotion recognition applications [49].…”
Section: Feature Extractionmentioning
confidence: 99%
“…Alternatively, in an effort to collect larger quantities of affect labels in a shorter amount of time, although with a potential loss in accuracy, crowd-sourcing on platforms such as Amazon Mechanical Turk (MTurk) has also been explored [ 3 , 53 , 54 , 55 , 56 ]. Some researchers utilize a mix of both online and offline collection methods [ 57 , 58 ], or even use predictive models such as AttendAffectNet [ 59 ] for the emotion labeling [ 60 ]. Regardless of the data collection method, it is important for each musical excerpt in the dataset to be labelled by multiple participants in order to account for subjectivity.…”
Section: Data Gathering Proceduresmentioning
confidence: 99%