Speech emotion recognition (SER) is a difficult and challenging task because of the affective variances between different speakers. The performances of SER are extremely reliant on the extracted features from speech signals. To establish an effective features extracting and classification model is still a challenging task. In this paper, we propose a new method for SER based on Deep Convolution Neural Network (DCNN) and Bidirectional Long Short-Term Memory with Attention (BLSTMwA) model (DCNN-BLSTMwA). We first preprocess the speech samples by data enhancement and datasets balancing. Secondly, we extract three-channel of log Mel-spectrograms (static, delta, and delta-delta) as DCNN input. Then the DCNN model pre-trained on ImageNet dataset is applied to generate the segment-level features. We stack these features of a sentence into utterance-level features. Next, we adopt BLSTM to learn the high-level emotional features for temporal summarization, followed by an attention layer which can focus on emotionally relevant features. Finally, the learned high-level emotional features are fed into the Deep Neural Network (DNN) to predict the final emotion. Experiments on EMO-DB and IEMOCAP database obtain the unweighted average recall (UAR) of 87.86 and 68.50%, respectively, which are better than most popular SER methods and demonstrate the effectiveness of our propose method.
The human visual system (HVS), affected by viewing distance when perceiving the stereo image information, is of great significance to study of stereoscopic image quality assessment. Many methods of stereoscopic image quality assessment do not have comprehensive consideration for human visual perception characteristics. In accordance with this, we propose a Rich Structural Index (RSI) for Stereoscopic Image objective Quality Assessment (SIQA) method based on multi-scale perception characteristics. To begin with, we put the stereo pair into the image pyramid based on Contrast Sensitivity Function (CSF) to obtain sensitive images of different resolution . Then, we obtain local Luminance and Structural Index (LSI) in a locally adaptive manner on gradient maps which consider the luminance masking and contrast masking. At the same time we use Singular Value Decomposition (SVD) to obtain the Sharpness and Intrinsic Structural Index (SISI) to effectively capture the changes introduced in the image (due to distortion). Meanwhile, considering the disparity edge structures, we use gradient cross-mapping algorithm to obtain Depth Texture Structural Index (DTSI). After that, we apply the standard deviation method for the above results to obtain contrast index of reference and distortion components. Finally, for the loss caused by the randomness of the parameters, we use Support Vector Machine Regression based on Genetic Algorithm (GA-SVR) training to obtain the final quality score. We conducted a comprehensive evaluation with state-of-the-art methods on four open databases. The experimental results show that the proposed method has stable performance and strong competitive advantage.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.