Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-2397
|View full text |Cite
|
Sign up to set email alerts
|

Predicting Arousal and Valence from Waveforms and Spectrograms Using Deep Neural Networks

Abstract: Automatic recognition of spontaneous emotion in conversational speech is an important yet challenging problem. In this paper, we propose a deep neural network model to track continuous emotion changes in the arousal-valence two-dimensional space by combining inputs from raw waveform signals and spectrograms, both of which have been shown to be useful in the emotion recognition task. The neural network architecture contains a set of convolutional neural network (CNN) layers and bidirectional long short-term mem… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
14
1

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 37 publications
(15 citation statements)
references
References 31 publications
0
14
1
Order By: Relevance
“…While these CCC scores are lower than results reported elsewhere [41] for the SEMAINE dataset (0.680 and 0.506 CCC for A and V, respectively) and the RECOLA dataset (0.692 and 0.423), we note that those datasets consist of recordings taken in carefully controlled conditions and annotated by groups of 6+ annotators, which ensured a level of uniformity unavailable in real-life, self-rated data.…”
Section: Resultscontrasting
confidence: 79%
“…While these CCC scores are lower than results reported elsewhere [41] for the SEMAINE dataset (0.680 and 0.506 CCC for A and V, respectively) and the RECOLA dataset (0.692 and 0.423), we note that those datasets consist of recordings taken in carefully controlled conditions and annotated by groups of 6+ annotators, which ensured a level of uniformity unavailable in real-life, self-rated data.…”
Section: Resultscontrasting
confidence: 79%
“…They managed to improve the valence prediction task using information from other modalities such as video and physiological signals. The work in [3] shows similar results on a couple of databases after extracting features from raw waveform and spectrogram using a convolutional neural network and passing them through a neural network based regressor to get the predicted arousal and valence scores. In [4], the authors employed a fuzzy inference based system and their results show a lower mean absolute error and a higher CCC in predicting arousal than valence across three different languages.…”
Section: Introductionmentioning
confidence: 70%
“…On the other hand, the recent advancements in deep learning, along with the available computational capabilities, have enabled the research community to build end-to-end systems for SER. A big advantage of such systems is that they can directly learn the features from spectrograms or raw waveforms [12,23,36,41,45], thereby obviating the need for extracting a large set of hand-crafted features [13]. Recent studies have proposed the use of convolutional neural network (CNN) models combined with long short-term memory (LSTM) built on spectrograms and raw waveforms, showing improved SER performance [19,23,24,26,35,36,46].…”
Section: Introductionmentioning
confidence: 99%