ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414880
|View full text |Cite
|
Sign up to set email alerts
|

Emotion Recognition by Fusing Time Synchronous and Time Asynchronous Representations

Abstract: In this paper, a novel two-branch neural network model structure is proposed for multimodal emotion recognition, which consists of a time synchronous branch (TSB) and a time asynchronous branch (TAB). To capture correlations between each word and its acoustic realisation, the TSB combines speech and text modalities at each input window frame and then uses pooling across time to form a single embedding vector. The TAB, by contrast, provides cross-utterance information by integrating sentence text embeddings fro… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
23
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 46 publications
(23 citation statements)
references
References 26 publications
0
23
0
Order By: Relevance
“…The results and modalities used in previous related work are summarised and compared in Table 4, which shows that our 4-way classification system achieved state-of-the-art results on IEMOCAP when evaluated with all of the three test settings. More detailed experiments and results can be found in [16].…”
Section: -Way Classification and Cross Comparisonsmentioning
confidence: 99%
See 1 more Smart Citation
“…The results and modalities used in previous related work are summarised and compared in Table 4, which shows that our 4-way classification system achieved state-of-the-art results on IEMOCAP when evaluated with all of the three test settings. More detailed experiments and results can be found in [16].…”
Section: -Way Classification and Cross Comparisonsmentioning
confidence: 99%
“…To evaluate the proposed approach, a state-of-the-art neural network model architecture proposed in [16] is adopted for emotion classification, which consists of a time synchronous branch (TSB) that focuses on modelling the temporal correlations of multimodal features and a time asynchronous branch (TAB) that takes sentence embeddings as input in order to facilitate the use of semantic meanings embedded in the text transcriptions. Experimental results on the widely used IEMOCAP dataset [8] show that the TSB-TAB structure achieves state-of-the-art classification re-sults in 4-way classification (happy, sad, angry & neutral) when evaluated with all of the commonly used speakerindependent test setups.…”
Section: Introductionmentioning
confidence: 99%
“…Performance comparison with 5-fold leaveone-session-out [12,7,13] and 10-fold leave-one-speaker-out [14,15,16,17,18,5] cross-validation strategy on IEMOCAP.…”
Section: Co-attention-based Fusionmentioning
confidence: 99%
“…Model WA UA CNN-ELM+STC attention [12] 61.32 60.43 Audio 25 [7] 60.64±1.96 61.32±2.26 IS09 -classification [13] 68.1 63.8 Ours 69.80 71.05 RNN(prop. )-ELM [14] 62.85 63.89 3D ACRNN [15] -64.74±5.44 BLSTM-CTC-CA [16] 69.0 67.0 CNN GRU-SeqCap [17] 72.73 59.71 CNN TF Att.pooling [18] 71.75 68.06 HNSD [5] 70.5 72.5 Ours 71.64 72.70…”
Section: Co-attention-based Fusionmentioning
confidence: 99%
See 1 more Smart Citation