Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-3252
|View full text |Cite
|
Sign up to set email alerts
|

Direct Modelling of Speech Emotion from Raw Speech

Abstract: Speech emotion recognition is a challenging task and heavily depends on hand-engineered acoustic features, which are typically cra ed to echo human perception of speech signals. However, a lter bank that is designed from perceptual evidence is not always guaranteed to be the best in a statistical modelling framework where the end goal is for example emotion classi cation. is has fuelled the emerging trend of learning representations from raw speech especially using deep learning neural networks. In particular,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
57
0
3

Year Published

2019
2019
2025
2025

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 89 publications
(60 citation statements)
references
References 38 publications
0
57
0
3
Order By: Relevance
“…Through the improvement of technologies, artificial intelligence and CNNs are the most popular sources that have achieved excessive success in many fields, such as handwriting recognition [ 28 ], object recognition [ 23 ], natural language processing [ 29 , 30 ], and SER [ 31 ]. The convolutional neural networks addressed the scalability issues of the traditional neural networks [ 32 , 33 ] by allowing them to share similar weights for multiple regions of the inputs [ 34 ]. Usually, the CNN model consists of three main building blocks that first include the convolution layers, second the pooling layers, and finally the fully connected layers.…”
Section: Methodsmentioning
confidence: 99%
“…Through the improvement of technologies, artificial intelligence and CNNs are the most popular sources that have achieved excessive success in many fields, such as handwriting recognition [ 28 ], object recognition [ 23 ], natural language processing [ 29 , 30 ], and SER [ 31 ]. The convolutional neural networks addressed the scalability issues of the traditional neural networks [ 32 , 33 ] by allowing them to share similar weights for multiple regions of the inputs [ 34 ]. Usually, the CNN model consists of three main building blocks that first include the convolution layers, second the pooling layers, and finally the fully connected layers.…”
Section: Methodsmentioning
confidence: 99%
“…By taking into account both the amount of training data and the network complexity, it is understandable that the segment duration of 250 ms turned out to be the best choice in our search for the optimal segment duration for the end-to-end systems. The method used in this work for choosing the optimal segment duration has also been adopted in [66] and [67].…”
Section: Pathological Voice Detection Using An End-to-end Systemmentioning
confidence: 99%
“…The related works in [6][7][8][9] proposed different mechanism to improve the performance of speech emotion recognition in normal environment. Speech emotion recognition system using CNN with the improvement of CapsNets are proposed in [6] by using IEMOCAP dataset and proved that CapsNets get the better performance than baseline CNNs in building the recognition model.…”
Section: Related Workmentioning
confidence: 99%
“…Speech emotion recognition system using CNN with the improvement of CapsNets are proposed in [6] by using IEMOCAP dataset and proved that CapsNets get the better performance than baseline CNNs in building the recognition model. The groups of [7] and [8] also used CNN based classifiers that leads to reliable improvements in accuracy of the speed emotion recognition model and two emotion dataset of IEMOCAP and MSP-IMPROV for unbalanced speed with unsupervised learning and for raw speed. The system used the Bag-of-Visual Words as the classification model on Audio Segment Spectrograms is proposed by the groups [9].…”
Section: Related Workmentioning
confidence: 99%