Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1149
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Modal Learning for Speech Emotion Recognition: An Analysis and Comparison of ASR Outputs with Ground Truth Transcription

Abstract: In this paper we plan to leverage multi-modal learning and automated speech recognition (ASR) systems toward building a speech-only emotion recognition model. Previous studies have shown that emotion recognition models using only acoustic features do not perform satisfactorily in detecting valence level. Text analysis has been shown to be helpful for sentiment classification. We compared classification accuracies obtained from an audio-only model, a text-only model and a multi-modal system leveraging both by p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
26
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 37 publications
(30 citation statements)
references
References 21 publications
3
26
1
Order By: Relevance
“…Feature set/modality, which attained the highest performance in the categorical approach, is different from the dimensional approach. In the categorical approach with the IEMOCAP dataset, word embeddings gave the highest performance in the unimodal model, as reported in [1,2,4,25]. In contrast, in the dimensional approach, the average performance of acoustic features gave better performance over text features.…”
Section: D) Discussion In Terms Of Categorical Emotionsmentioning
confidence: 77%
See 2 more Smart Citations
“…Feature set/modality, which attained the highest performance in the categorical approach, is different from the dimensional approach. In the categorical approach with the IEMOCAP dataset, word embeddings gave the highest performance in the unimodal model, as reported in [1,2,4,25]. In contrast, in the dimensional approach, the average performance of acoustic features gave better performance over text features.…”
Section: D) Discussion In Terms Of Categorical Emotionsmentioning
confidence: 77%
“…In [1,23], the authors used different deep learning architectures to predict categorical emotion from both speech and text. Some authors used phonemes instead of text for predicting emotion category, such as in [3,24] and another author compared text feature from ASR with manual transcription to investigate the effectiveness of its combination with acoustic features for categorical emotion recognition [25]. Those research, although used audio and text features, only predicted categorical emotion.…”
Section: I R E L a T E D W O R Kmentioning
confidence: 99%
See 1 more Smart Citation
“…We observe that the proposed model achieves better performance and data augmentation helps to improve the robustness. We also compare our results with previous studies ( [36,41]) in the cross-corpus setting in Table 4. In [36], authors employ a multi-task framework and exploit larger unlabelled data for the auxiliary task to improve the generalisation of the model.…”
Section: Cross-corpus Settingsmentioning
confidence: 79%
“…In [36], authors employ a multi-task framework and exploit larger unlabelled data for the auxiliary task to improve the generalisation of the model. In [41], the authors develop a multi-modal technique (audio plus text) for SER based on ASR transcriptions. They demonstrate that the generalisability of ASR models helps to improve the generalisation of emotion classification models.…”
Section: Cross-corpus Settingsmentioning
confidence: 99%