In recent years, many visual positioning algorithms have been proposed based on computer vision and they have achieved good results. However, these algorithms have a single function, cannot perceive the environment, and have poor versatility, and there is a certain mismatch phenomenon, which affects the positioning accuracy. Therefore, this paper proposes a location algorithm that combines a target recognition algorithm with a depth feature matching algorithm to solve the problem of unmanned aerial vehicle (UAV) environment perception and multi-modal image-matching fusion location. This algorithm was based on the single-shot object detector based on multi-level feature pyramid network (M2Det) algorithm and replaced the original visual geometry group (VGG) feature extraction network with the ResNet-101 network to improve the feature extraction capability of the network model. By introducing a depth feature matching algorithm, the algorithm shares neural network weights and realizes the design of UAV target recognition and a multi-modal image-matching fusion positioning algorithm. When the reference image and the real-time image were mismatched, the dynamic adaptive proportional constraint and the random sample consensus consistency algorithm (DAPC-RANSAC) were used to optimize the matching results to improve the correct matching efficiency of the target. Using the multi-modal registration data set, the proposed algorithm was compared and analyzed to verify its superiority and feasibility. The results show that the algorithm proposed in this paper can effectively deal with the matching between multi-modal images (visible image-infrared image, infrared image-satellite image, visible image-satellite image), and the contrast, scale, brightness, ambiguity deformation, and other changes had good stability and robustness. Finally, the effectiveness and practicability of the algorithm proposed in this paper were verified in an aerial test scene of an S1000 sixrotor UAV.