Since the contextual information has an important impact on the speaker's emotional state, how to use emotion-related context information to conduct feature learning is a key problem. The existing speech emotion recognition algorithms achieve the relatively high recognition rate; these algorithms are not very good application to the real-life speech emotion recognition systems. Therefore, in order to address the abovementioned issues, a novel speech emotion recognition algorithm based on improved stacked kernel sparse deep model is proposed in this paper, which is based on auto-encoder, denoising auto-encoder, and sparse auto-encoder to improve the Chinese speech emotion recognition. The first layer of the structure uses a denoising autoencoder to learn a hidden feature with a larger dimension than the dimension of the input features, and the second layer employs a sparse auto-encoder to learn sparse features. Finally, a wavelet-kernel sparse SVM classifier is applied to classify the features. The proposed algorithm is evaluated on the testing dataset, which contains the speech emotion data of spontaneous, non-prototypical, and long-term. The experimental results show that the proposed algorithm outperforms the existing state-of-the-art algorithms in speech emotion recognition.