2019 International Joint Conference on Neural Networks (IJCNN) 2019
DOI: 10.1109/ijcnn.2019.8851942
|View full text |Cite
|
Sign up to set email alerts
|

Deep Fusion: An Attention Guided Factorized Bilinear Pooling for Audio-video Emotion Recognition

Abstract: Automatic emotion recognition (AER) is a challenging task due to the abstract concept and multiple expressions of emotion. Although there is no consensus on a definition, human emotional states usually can be apperceived by auditory and visual systems. Inspired by this cognitive process in human beings, it's natural to simultaneously utilize audio and visual information in AER. However, most traditional fusion approaches only build a linear paradigm, such as feature concatenation and multi-system fusion, which… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
25
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
7
1
1

Relationship

1
8

Authors

Journals

citations
Cited by 28 publications
(25 citation statements)
references
References 33 publications
0
25
0
Order By: Relevance
“…The higher accuracy is achieved thanks to a redundancy gain that reduces the amount of uncertainty in the resulting information. Recent works show a growing interest toward multi-sensory fusion in several application areas, such as developmental robotics (Droniou et al, 2015 ; Zahra and Navarro-Alarcon, 2019 ), audio-visual signal processing (Shivappa et al, 2010 ; Rivet et al, 2014 ), spatial perception (Pitti et al, 2012 ), attention-driven selection (Braun et al, 2019 ) and tracking (Zhao and Zeng, 2019 ), memory encoding (Tan et al, 2019 ), emotion recognition (Zhang et al, 2019 ), multi-sensory classification (Cholet et al, 2019 ), HMI (Turk, 2014 ), remote sensing and earth observation (Debes et al, 2014 ), medical diagnosis (Hoeks et al, 2011 ), and understanding brain functionality (Horwitz and Poeppel, 2002 ).…”
Section: Introductionmentioning
confidence: 99%
“…The higher accuracy is achieved thanks to a redundancy gain that reduces the amount of uncertainty in the resulting information. Recent works show a growing interest toward multi-sensory fusion in several application areas, such as developmental robotics (Droniou et al, 2015 ; Zahra and Navarro-Alarcon, 2019 ), audio-visual signal processing (Shivappa et al, 2010 ; Rivet et al, 2014 ), spatial perception (Pitti et al, 2012 ), attention-driven selection (Braun et al, 2019 ) and tracking (Zhao and Zeng, 2019 ), memory encoding (Tan et al, 2019 ), emotion recognition (Zhang et al, 2019 ), multi-sensory classification (Cholet et al, 2019 ), HMI (Turk, 2014 ), remote sensing and earth observation (Debes et al, 2014 ), medical diagnosis (Hoeks et al, 2011 ), and understanding brain functionality (Horwitz and Poeppel, 2002 ).…”
Section: Introductionmentioning
confidence: 99%
“…• First, there is an independent activity computation (Equation (13)): each neuron of the two SOMs computes its activity based on the afferent activity from the input. • Second, there is a cooperation amongst neurons from different modalities (Equations (14) and (15)): each neuron updates its afferent activity via a multiplication with the lateral activity from the neurons of the other modality. • Third and finally, there is a global competition amongst all neurons (line 19 in Algorithm 4): they all compete to elect a winner, that is, a global BMU with respect to the two SOMs.…”
Section: Resom Convergence For Classificationmentioning
confidence: 99%
“…Multimodal data fusion is thus a direct consequence of the well-accepted paradigm that certain natural processes and phenomena are expressed under completely different physical guises [6]. Recent works show a growing interest toward multimodal association in several applicative areas such as developmental robotics [3], audio-visual signal processing [7,8], spatial perception [9,10], attention-driven selection [11] and tracking [12], memory encoding [13], emotion recognition [14], human-machine interaction [15], remote sensing and earth observation [16], medical diagnosis [17], understanding brain functionality [18], and so forth. Interestingly, the last mentioned application is our starting bloc: how does the brain handle multimodal learning in the natural environment?…”
Section: Introductionmentioning
confidence: 99%
“…Audio processing and Spectrogram calculation. For each audio, the speech spectrogram and log Mel-spectrogram extraction process is consistent with [32] and [3] respectively. For speech spectrogram, we use the Hamming window with 40 msec window size and 10 msec shift.…”
Section: Video Preprocessingmentioning
confidence: 99%