A deep learning approach is used in this study to provide insight into aerobics movement recognition, and the model is used for aerobics movement recognition. The model complexity is significantly reduced, while the multi-scale features of the target at the fine-grained level are extracted, significantly improving the characterization of the target, by embedding lightweight multi-scale convolution modules in 3D convolutional residual networks to increase the local perceptual field range in each layer of the network. Finally, using the channel attention mechanism, the key features are extracted from the multi-scale features. To create a dual-speed frame rate detection model, the fast-slow combination idea is fused into a 3D convolutional network. To obtain spatial semantic information and motion information in the video, the model uses different frame rates, and the two-channel information is fused with features using lateral concatenation. Following the acquisition of all features, the features are fed into a temporal detection network to identify temporal actions and to design a behavior recognition system for the network model to demonstrate the network model's applicability. The average scores of students in the experimental group were significantly higher than those in the control group in seven areas: set accuracy, movement amplitude, movement strength, body coordination, coordination of movement and music, movement expression, and aesthetics; the average scores of movement proficiency and body control in the experimental group were also significantly higher than those in the control group, but the differences were not significant. The differences between the eight indicators in the experimental group were not significant when compared to those in the preexperimental group, indicating that intensive rhythm training for students improves secondary school students' comprehension, proficiency, and presentation of aerobics sets.