Moving target detection and tracking technology is the core technology in the field of computer vision. It integrates image processing, pattern recognition and intelligence, and artificial intelligence and automatic control are the keys to an intelligent video surveillance system. The method acquires video image signals through visible light or infrared sensors, performs digital image processing on the video images, detects moving targets, and then extracts moving targets for target recognition. Then, the moving target is predicted and tracked according to the image features and spatiotemporal features, and the contour shape, position, and motion trajectory of the target are obtained, which provides data support for subsequent tasks. This paper uses the convolutional neural network model HyperNet as the technical support to study the deep learning (DL) tennis video target extraction. The final experimental results show that the loss value of the loss function in the training process of this method is stably maintained between 1.5% and 2.5%, and the speed performance is also greatly improved. The number of boxes for extracting candidate regions is significantly reduced, the calculation time for each frame will not exceed 1.8 s, the orientation accuracy of target extraction is 96.32%, and the size accuracy is 91.05%.