2017
DOI: 10.1109/jstsp.2017.2764438
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Multimodal Emotion Recognition Using Deep Neural Networks

Abstract: Automatic affect recognition is a challenging task due to the various modalities emotions can be expressed with. Applications can be found in many domains including multimedia retrieval and human computer interaction. In recent years, deep neural networks have been used with great success in determining emotional states. Inspired by this success, we propose an emotion recognition system using auditory and visual modalities. To capture the emotional content for various styles of speaking, robust features need t… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

4
310
0
1

Year Published

2018
2018
2021
2021

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 552 publications
(315 citation statements)
references
References 42 publications
(48 reference statements)
4
310
0
1
Order By: Relevance
“…Perez-Gaspar et al [53] extended the evolutionary computation of artificial neural networks and hidden Markov models in classifying four emotions (angry, sad, happy, and neutral) by combining the speech with facial expressions. Tzirakis et al [54] also fused the auditory and visual modalities using an end-to-end valencearousal emotion recognition model. They applied the CNN and ResNet models to extract features from speech and visual signals, respectively, which were then concatenated and fed into the LSTM model to accomplish the end-toend training manner.…”
Section: Multimodal Frameworkmentioning
confidence: 99%
“…Perez-Gaspar et al [53] extended the evolutionary computation of artificial neural networks and hidden Markov models in classifying four emotions (angry, sad, happy, and neutral) by combining the speech with facial expressions. Tzirakis et al [54] also fused the auditory and visual modalities using an end-to-end valencearousal emotion recognition model. They applied the CNN and ResNet models to extract features from speech and visual signals, respectively, which were then concatenated and fed into the LSTM model to accomplish the end-toend training manner.…”
Section: Multimodal Frameworkmentioning
confidence: 99%
“…We compare the performance of proposed approach with raw waveform-based END2YOU network [23] for three sub challenges. In terms of U AR, proposed raw-CNN approach has a comparable performance with baseline for Crying task, whereas it outperforms the baseline for other two tasks, see Table 4.…”
Section: Results On Raw-waveform Cnnmentioning
confidence: 99%
“…We also compare the proposed approach with the corresponding end-to-end baseline. As the confidence scores are not available for the baseline system [23], it is not possible to use it for fusion techniques.…”
Section: Introductionmentioning
confidence: 99%
“…A different approach is presented in [12] which considers using a CNN to extract features from the speech, whilst, in order to represent the visual information, a deep residual network ResNet-50 is used. The features from the beforementioned networks are concatenated and inserted into a twolayers Long Short-Term Memory (LSTM) module.…”
Section: B Emotion Recognition Using Visual and Audio Informationmentioning
confidence: 99%