2021
DOI: 10.20944/preprints202108.0433.v1
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Novel Heterogeneous Parallel Convolution Bi-LSTM for Speech Emotion Recognition

Abstract: Speech emotion recognition remains a heavy lifting in natural language processing. It has strict requirements to the effectiveness of feature extraction and that of acoustic model. With that in mind, a Heterogeneous Parallel Convolution Bi-LSTM model is proposed to address these challenges. It consists of two heterogeneous branches: the left one contains two dense layers and a Bi-LSTM layer, while the right one contains a dense layer, a convolution layer, and a Bi-LSTM layer. It can exploit the spatiotemporal … Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 12 publications
(4 citation statements)
references
References 25 publications
0
4
0
Order By: Relevance
“…However, the proposed ESN model is able to outperform these deep learning models by achieving 84.80% UA. Authors in [13] adopted the Heterogeneous Parallel Convolution Bi-LSTM model and applied speaker-independent for SAVEE dataset and they achieved 56.5% of UA, and Random Deep Belief Networks model [44] performed 53.60% UA for SAVEE, however, our method obtained 65.95% UA. Unlike EMODB and SAVEE, one can notice the big challenge to gain higher accuracy for the Aibo dataset.…”
Section: Comparison With the State-of-the-artmentioning
confidence: 84%
See 1 more Smart Citation
“…However, the proposed ESN model is able to outperform these deep learning models by achieving 84.80% UA. Authors in [13] adopted the Heterogeneous Parallel Convolution Bi-LSTM model and applied speaker-independent for SAVEE dataset and they achieved 56.5% of UA, and Random Deep Belief Networks model [44] performed 53.60% UA for SAVEE, however, our method obtained 65.95% UA. Unlike EMODB and SAVEE, one can notice the big challenge to gain higher accuracy for the Aibo dataset.…”
Section: Comparison With the State-of-the-artmentioning
confidence: 84%
“…Typical approaches for supporting temporal data are LSTM and ESN. Authors in [13] used Bi-LSTM deep learning with two heterogeneous branches where the left side has two dense layers and the right side has a convolution layer. Additionally, the handcrafted time-series features with 512 frames are used in [14] for feeding CNN and Bi-LSTM model.…”
Section: Literature Reviewmentioning
confidence: 99%
“…Reference [25] adopted the spectrogram, fused it with the convolutional neural network (CNN) and attention mechanism, and used two different convolution kernels to extract time-domain features and frequency-domain features respectively, and saved the spectrogram as an image directly. After normalization processing, the accuracy of speech emotion evaluation is high.…”
Section: Spectrogrammentioning
confidence: 99%
“…Mei Wang, et al [6] fused electroencephalographs (EEGs) and facial expression information by using maximum weight multimodal fusion in decision level fusion. However, a heterogeneity gap exists [7] between the different modalities and some researchers have neglected the connection between them [8]. We need to narrow the heterogeneity gaps and make good use of the connections between the different modalities.…”
Section: Imentioning
confidence: 99%