Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-Hml) 2018
DOI: 10.18653/v1/w18-3304
|View full text |Cite
|
Sign up to set email alerts
|

Convolutional Attention Networks for Multimodal Emotion Recognition from Speech and Text Data

Abstract: Emotion recognition has become a popular topic of interest, especially in the field of human computer interaction. Previous works involve unimodal analysis of emotion, while recent efforts focus on multimodal emotion recognition from vision and speech. In this paper, we propose a new method of learning about the hidden representations between just speech and text data using convolutional attention networks. Compared to the shallow model which employs simple concatenation of feature vectors, the proposed attent… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
19
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 43 publications
(19 citation statements)
references
References 10 publications
0
19
0
Order By: Relevance
“…The model for audio-visual person separation can be classified under audio-visual sound separation systems (AV-SS) [6]. Such systems aim to extract one or multiple voice targets from a mixture of voice and visual information [20,21]. This information can then be used in different applications, such as focusing attention [22,23], establishing eye contact [24,25], and building a mutual understanding [26], etc.…”
Section: Current and Comparable Concepts Used For Audio-visual System Perception And Contribution Of Presented Methodologymentioning
confidence: 99%
“…The model for audio-visual person separation can be classified under audio-visual sound separation systems (AV-SS) [6]. Such systems aim to extract one or multiple voice targets from a mixture of voice and visual information [20,21]. This information can then be used in different applications, such as focusing attention [22,23], establishing eye contact [24,25], and building a mutual understanding [26], etc.…”
Section: Current and Comparable Concepts Used For Audio-visual System Perception And Contribution Of Presented Methodologymentioning
confidence: 99%
“…Differently, Zadeh et al [43] present Graph-MFN, which synchronizes the multimodal sequences by storing intra-modality and crossmodality interactions through time with a graph structure. Attention mechanism has been exploited by several works as well [44], [45], [41], [46], [47], [15], [17], [48], [21], [49]. For example, Dai et al [21] present MESM that is composed of sparse cross-modal attention mechanism attached to the joint learning of multimodal features.…”
Section: Related Workmentioning
confidence: 99%
“…The weighted accuracy (WA) and unweighted accuracy (UA) of 64.08 and 56.41% were obtained from the IEMOCAP dataset, respectively. Lee et al (2018) proposed a model combining the convolutional neural network with the attention mechanism and the text data. The promising experimental result in the CMU-MOSEI database proved the effectiveness of the combination of the two modalities.…”
Section: Introductionmentioning
confidence: 99%