Emotion recognition, as a hot topic in human-computer interaction, is now widely used in the field of education, which provides teachers with real-time teaching feedback through emotion computation. Physical behavior is one of the main means for humans to express emotions. This paper proposes to combine a temporal segmented sampling network (TSN) and spatiotemporal graph convolutional network (STGCN) to construct a dual-stream feature model to describe the spatiotemporal characteristics of physical behavior. The performance of the model was tested on 192 volunteers, and it was found that the TSN+STGCN model had the highest recognition rate in the NM (97.31%), BG (88.22%), and CL (78.95%) conditions compared to the CNN model and STGCN model. The angle of recognition for emotional recognition will affect the recognition rate, which is reflected in the fact that the closer the angle of recognition is to 90°, the worse the recognition rate is. In the six emotion recognition tests, the TSN+STGCN model can completely recognize the anger and sadness emotions; the happy emotion (98.18%) has the highest recognition accuracy, and the shocked emotion (73.91%) has the lowest recognition accuracy. The model was tested in a classroom as an example, and it was found that the number of students with positive emotions under a certain moment was higher, and the students’ mood fluctuation ladder line within the classroom also provided a basis for teachers to judge the students’ learning status within the classroom.