With the development of AI technology, human-computer interaction technology is no longer the traditional mouse and keyboard interaction. AI and VR have been widely used in early childhood education. In the process of the slow development and application of voice interaction, visual interaction, action interaction, and other technologies, multimodal interaction technology system has become a research hotspot. In this paper, dynamic image capture and recognition technology is integrated into early childhood physical education for intelligent interaction. According to the basic movement process and final node matching in children’s sports training to judge children’s physical behavior ability, attention is paid to identify the accuracy and safety of movement. The input images and questions are from the abstract clipart dataset of dynamic image recognition and the self-made 3D dataset of Web3D dynamic motion scene with the same style, which is similar to the action content in the actual preschool training teaching. Therefore, according to the idea of process capture and target recognition, on the basis of the original conditions of the recognition model, a new recognition model is developed through Zheng’s target detector. The modified model is characterized by higher accuracy. Weapons need to combine process recognition and result recognition. The experimental results show that the improved model has the obvious advantages of high precision and fast speed, which provides a new research idea for the development of children's physical training simulation.