In recent years, the increasing popularity of smart mobile devices has made the interaction between devices and users, particularly through voice interaction, more crucial. By enabling smart devices to better understand users’ emotional states through voice data, it becomes possible to provide more personalized services. This paper proposes a novel machine learning model for speech emotion recognition called CLDNN, which combines convolutional neural networks (CNN), long short-term memory neural networks (LSTM), and deep neural networks (DNN). To design a system that closely resembles the human auditory system in recognizing audio signals, this article uses the Mel-frequency cepstral coefficients (MFCCs) of audio data as the input of the machine learning model. First, the MFCCs of the voice signal are extracted as the input of the model. Local feature learning blocks (LFLBs) composed of one-dimensional CNNs are employed to calculate the feature values of the data. As audio signals are time-series data, the resulting feature values from LFLBs are then fed into the LSTM layer to enhance learning on the time-series level. Finally, fully connected layers are used for classification and prediction. The experimental evaluation of the proposed model utilizes three databases: RAVDESS, EMO-DB, and IEMOCAP. The results demonstrate that the LSTM model effectively models the features extracted from the 1D CNN due to the time-series characteristics of speech signals. Additionally, the data augmentation method applied in this paper proves beneficial in improving the recognition accuracy and stability of the systems for different databases. Furthermore, according to the experimental results, the proposed system achieves superior recognition rates compared to related research in speech emotion recognition.