In recent years, with the continuous development of computer technology, deep learning has been widely applied to computer vision tasks and has achieved great success in areas such as visual detection and tracking. On this basis, making deep learning techniques truly accessible to people becomes the next objective. Target detection and tracking in football gesture training is a quite challenging task with great practical and commercial value. In traditional football training methods, target trajectories are often extracted by means of a recording chip carried by the player. However, the cost of this method is high and it is difficult to replicate in amateur stadiums. Some studies have also used only cameras to process targets in football videos. However, due to the similarity in appearance and frequent occlusion of targets in football videos, these methods often only segment targets such as players and balls in the image but do not allow them to be tracked. Target tracking techniques are of great importance in football training and are the basis for tasks such as player training analysis and match strategy development. In recent years, many excellent algorithms have emerged in the field of target tracking, mainly in the categories of correlation filtering and deep learning, but none of them are able to achieve high accuracy in player tracking for football training videos. After all, the problem of locating clips of interest to athletes from a full-length video is a pressing one. Traditional machine learning-based approaches to sports event detection have poor accuracy and are limited in the types of events they can detect. These traditional methods often rely on auxiliary information such as audio commentary and relevant text, which are less stable than video. In recent years, deep learning-based methods have made great progress in the detection of single-player video events and actions, but less so in the detection of sports video events. As a result, there are few sports video datasets that can be used for deep learning training. Based on research in computer vision and deep learning, this paper designs a multitarget tracking system for football training. To be specific, this algorithm uses multiple cameras for image acquisition in the stadium in order to accurately track multiple targets in the stadium over time. Furthermore, the framework for a single camera multitarget tracking approach has been designed based on deep learning-based visual detection methods and correlation filter-based tracking methods. This framework focuses on using data correlation algorithms to fuse the results of detectors and trackers so that multiple targets can be tracked accurately in a single camera. To sum up, this research allows for robust and real-time long-term accurate tracking of targets in football training videos through multitarget tracking algorithms and the intercorrection of multiple camera systems.