The purpose is to apply deep learning to facial expression recognition and improve the efficiency of facial expression recognition in real scenes: first, convolutional neural network (CNN) theoretical model is proposed. On this basis, a new CNN model is proposed, which has six learning layers, including three convolutional layers and three fully connected layers. Then, based on the shortcomings of the model, weight decay and dropout optimization methods are proposed. Moreover, the attention module is designed based on the theory of attention mechanism. After each convolutional layer, a hybrid attention module (channel domain attention and spatial domain attention) is added. Finally, the frame error rate (FER) 2013 dataset and Cohn–Kanade (CK)+ dataset are used to test and compare the performance of the model. The results show that after 50Epoch, the accuracy rate of the original model fluctuates greatly in the process of convergence; however, after attention mechanism is added, the fluctuation of the accuracy of the model is obviously reduced. In the CK+ dataset, the accuracy of each model is maintained at about 95%, while the accuracy of FER2013 dataset is about 71%. After attention mechanism is added, the recognition rate of dynamic nonlinear expression is higher than that of basic model. The recognition rate of the original model and the optimized model with attention mechanism for the dynamic nonlinear expression with occlusion decreases in varying degrees, but the reduction of the recognition rate of the latter is alleviated.