“…On the other hand, the recent advancements in deep learning, along with the available computational capabilities, have enabled the research community to build end-to-end systems for SER. A big advantage of such systems is that they can directly learn the features from spectrograms or raw waveforms [12,23,36,41,45], thereby obviating the need for extracting a large set of hand-crafted features [13]. Recent studies have proposed the use of convolutional neural network (CNN) models combined with long short-term memory (LSTM) built on spectrograms and raw waveforms, showing improved SER performance [19,23,24,26,35,36,46].…”