Information is obtained from human eyes for thinking divergence, and further associated with computer equipment, so human beings endow computers with the ability of “vision” to convey and feel information. This field has developed for many years, and many aspects can be in line with other research directions, such as artificial intelligence, which has become popular in recent years, and pattern recognition, which has been applied a lot. In order to sort out the structure and content of multitarget recognition smoothly, this paper starts from the perspective of shallow vision, uses theory and practical experiments, and chooses the core technology with the largest weight from massive computer technologies, so that the recognition algorithm can compare with the recognition algorithm. The research shows that (1) CNN shows its unique feature ability and incomparable detection accuracy from many models, and the error rate can be reduced from 28.07% to 18.40%. (2) The method of candidate region is complex, and the larger the region, the more difficult it is to calculate. The method based on regression is far beyond it in both precision and speed and is more suitable for the research of this subject. (3) When the mAP increases, the speed is forced to slow down. If the image resolution is high with the same model, the mAP will be high (SSD and YOLO models are often used). Experiments show that the recognition effect is obvious. At the end of the article, the advantages and disadvantages of this study are summarized. In the field of computer vision, people need to do more in-depth research. Follow-up can optimize multitarget recognition and detection and strive to improve the accuracy.