The analysis of teachers' and students' behaviors in physical education classrooms is an important way to improve the quality of physical education teaching and teaching methods, which helps teachers to check the gaps and improve the teaching level. In this paper, for the problems of data differences between multiple modalities and the conflict between feature extraction modules of different modalities, we designed a dual-stream framework HRformer algorithm based on Transformer, which unifies the skeletal modalities and video modalities in the algorithm. The relationship between skeletal and video modalities is modeled using the self-attention mechanism, and the matching and fusion of skeletal features and video data is performed to construct a behavior recognition model for teachers and students in the sports classroom based on multimodal data. Then, the model is compared with mainstream networks on the dataset to verify its performance. To conduct model application and example analysis, a university collects data on physical education classroom teachers and students for a semester. It is found that the multimodal model in this paper has a classification F1 value of 95.61%, 93.19%, and 93.74% for the three types of behavior recognition, namely, skill training (ST), game activity (GA), and rest, respectively, which are higher than the two methods of single skeletal modality and video modality. The model has the highest recognition accuracy of 97.12% and 98.15% for Game Activity (GA). Based on real physical education classroom data, the practical application of the model in physical education teaching classrooms in this paper is fruitful, and the results of behavioral recognition classification are in line with the design expectation. This study develops an effective method for classifying teacher and student behaviors in a physical education classroom. It provides a useful exploration for the integration and innovation of physical education teaching and information technology.