ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054629
|View full text |Cite
|
Sign up to set email alerts
|

Speech Emotion Recognition with Dual-Sequence LSTM Architecture

Abstract: Speech Emotion Recognition (SER) has emerged as a critical component of the next generation of human-machine interfacing technologies. In this work, we propose a new dual-level model that combines handcrafted and raw features for audio signals. Each utterance is preprocessed into a handcrafted input and two mel-spectrograms at different time-frequency resolutions. An LSTM processes the handcrafted input, while a novel LSTM architecture, denoted as Dual-Sequence LSTM (DS-LSTM), processes the two mel-spectrogram… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
42
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 109 publications
(54 citation statements)
references
References 16 publications
0
42
0
Order By: Relevance
“…The authors employ Low-Level Descriptors (LLDs) extracted from the audio signal as input. (Wang et al, 2020) propose a model consisting of two jointly trained LSTMs: each of these models is separately used to process MFCC features and Mel-spectrograms. Both models predict an output class (emotion) that is averaged to arrive at the result.…”
Section: Related Workmentioning
confidence: 99%
“…The authors employ Low-Level Descriptors (LLDs) extracted from the audio signal as input. (Wang et al, 2020) propose a model consisting of two jointly trained LSTMs: each of these models is separately used to process MFCC features and Mel-spectrograms. Both models predict an output class (emotion) that is averaged to arrive at the result.…”
Section: Related Workmentioning
confidence: 99%
“…We used IEMOCAP dataset [29], a benchmark dataset containing 12 hours of speech from 10 professional actors. Following the literature [30,31,32,33], we extracted 5531 utterances of four emotion types from the dataset: 1636 happy (also including excited), 1084 sad, 1103 angry, and 1708 neutral. The utterances were forced aligned using the P2FA forced aligner.…”
Section: Datamentioning
confidence: 99%
“…As phishing URLs comprise sequences of characters and words, existing CNN and LSTM networks are known as the amenable method for URL feature extraction. The combination of the CNN and LSTM for estimating the time-invariant filter coefficients is already widely used in the field of spatiotemporal feature modeling [13], [14].…”
Section: Triplet Network For Embedding Of Phishing Attacksmentioning
confidence: 99%