Emotion recognition technology is one of the important applications of artificial intelligence and machine learning in the field of education. By recognizing the emotions of students in learning scenes, teachers can better understand the learning status of students and provide them with personalized learning resources and help. Current emotion recognition methods are mainly based on static facial emotions, neglecting the temporal features of facial emotions, which may lead to inaccurate recognition results. In order to overcome these challenges, this study conducts research on emotion recognition and its application in learning scenes supported by smart classrooms. The Transformer encoder is used to extract the temporal features of student facial emotions based on learning scenes, i.e., the selfattention module of the encoder is used to extract the temporal features of facial emotions in learning scenes. Residual attention networks, Transformers, and non-local neural networks are used to extract facial emotion features from different perspectives and levels. The combination of Vision-Transformer (ViT) and NetVLAD enables the model to learn the features of data from multiple perspectives, thereby improving the generalization ability of the model. The experimental results verify the effectiveness of the constructed model.