Detecting students’ classroom behaviors from instructional videos is important for instructional assessment, analyzing students’ learning status, and improving teaching quality. To achieve effective detection of student classroom behavior based on videos, this paper proposes a classroom behavior detection model based on the improved SlowFast. First, a Multi-scale Spatial-Temporal Attention (MSTA) module is added to SlowFast to improve the ability of the model to extract multi-scale spatial and temporal information in the feature maps. Second, Efficient Temporal Attention (ETA) is introduced to make the model more focused on the salient features of the behavior in the temporal domain. Finally, a spatio-temporal-oriented student classroom behavior dataset is constructed. The experimental results show that, compared with SlowFast, our proposed MSTA-SlowFast has a better detection performance with mean average precision (mAP) improvement of 5.63% on the self-made classroom behavior detection dataset.