With the rapid development of Internet technology, breakthroughs have been made in all branches of computer vision. Especially in image detection and target tracking, deep learning techniques such as convolutional neural networks have achieved excellent results. In order to explore the applicability of machine learning technology in the field of video text recognition and extraction, a YOLOv3 network based on multiscale feature transformation and migration fusion is proposed to improve the accuracy of english text detection in natural scenes in video. Firstly, aiming at the problem of multiscale target detection in video key frames, based on the YOLOv3 network, the scale conversion module of STDN algorithm is used to reduce the low-level feature map, and a backbone network with feature reuse is constructed to extract features. Then, the scale conversion module is used to enlarge the high-level feature map, and a feature pyramid network (FPN) is built to predict the target. Finally, the improved YOLOv3 network is verified to extract key text from images. The experimental results show that the improved YOLOv3 network can effectively improve the false detection and missed detection caused by occlusion and small target, and the accuracy of English text extraction is obviously improved.