In smart classroom environments, accurately recognizing students’ facial expressions is crucial for teachers to efficiently assess students’ learning states, timely adjust teaching strategies, and enhance teaching quality and effectiveness. In this paper, we propose a student facial expression recognition model based on multi-scale and deep fine-grained feature attention enhancement (SFER-MDFAE) to address the issues of inaccurate facial feature extraction and poor robustness of facial expression recognition in smart classroom scenarios. Firstly, we construct a novel multi-scale dual-pooling feature aggregation module to capture and fuse facial information at different scales, thereby obtaining a comprehensive representation of key facial features; secondly, we design a key region-oriented attention mechanism to focus more on the nuances of facial expressions, further enhancing the representation of multi-scale deep fine-grained feature; finally, the fusion of multi-scale and deep fine-grained attention-enhanced features is used to obtain richer and more accurate facial key information and realize accurate facial expression recognition. The experimental results demonstrate that the proposed SFER-MDFAE outperforms the existing state-of-the-art methods, achieving an accuracy of 76.18% on FER2013, 92.75% on FERPlus, 92.93% on RAF-DB, 67.86% on AffectNet, and 93.74% on the real smart classroom facial expression dataset (SCFED). These results validate the effectiveness of the proposed method.