2021
DOI: 10.48550/arxiv.2110.15684
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Fusing ASR Outputs in Joint Training for Speech Emotion Recognition

Abstract: Alongside acoustic information, linguistic features based on speech transcripts have been proven useful in Speech Emotion Recognition (SER). However, due to the scarcity of emotion labelled data and the difficulty of recognizing emotional speech, it is hard to obtain reliable linguistic features and models in this research area. In this paper, we propose to fuse Automatic Speech Recognition (ASR) outputs into the pipeline for joint training SER. The relationship between ASR and SER is understudied, and it is u… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 16 publications
0
2
0
Order By: Relevance
“…Research on emotion recognition is becoming more and more popular, such as speech-based emotion recognition [21], textbased emotion recognition [22], multimodal emotion recognition [23,24] and action-based emotion recognition [25]. Previous studies have revealed that emotion can be reflected from action.…”
Section: Introductionmentioning
confidence: 99%
“…Research on emotion recognition is becoming more and more popular, such as speech-based emotion recognition [21], textbased emotion recognition [22], multimodal emotion recognition [23,24] and action-based emotion recognition [25]. Previous studies have revealed that emotion can be reflected from action.…”
Section: Introductionmentioning
confidence: 99%
“…[4] evaluated several handcrafted feature sets designed for different computational paralinguistics tasks and proposed a novel Active Data Representation method using acoustic features of all arXiv:2211.08526v1 [cs.RO] 15 Nov 2022 speech segments with a single fixed-dimension feature vector. Inspired by successful end-to-end approaches in speech and emotion recognition [19], [20], some recent works extract features directly from log-mel spectrograms instead of using handcrafted ones, resulting in better performance [21], [22].…”
Section: Related Work a Ad Detection From Speechmentioning
confidence: 99%