In this paper, a visual navigation method based on binocular vision and a deep learning approach is proposed to solve the navigation problem of the unmanned aerial vehicle autonomous aerial refueling docking process. First, to meet the requirements of high accuracy and high frame rate in aerial refueling tasks, this paper proposes a single-stage lightweight drogue detection model, which greatly increases the inference speed of binocular images by introducing image alignment and depth-separable convolution and improves the feature extraction capability and scale adaptation performance of the model by using an efficient attention mechanism (ECA) and adaptive spatial feature fusion method (ASFF). Second, this paper proposes a novel method for estimating the pose of the drogue by spatial geometric modeling using optical markers, and further improves the accuracy and robustness of the algorithm by using visual reprojection. Moreover, this paper constructs a visual navigation vision simulation and semi-physical simulation experiments for the autonomous aerial refueling task, and the experimental results show the following: (1) the proposed drogue detection model has high accuracy and real-time performance, with a mean average precision (mAP) of 98.23% and a detection speed of 41.11 FPS in the embedded module; (2) the position estimation error of the proposed visual navigation algorithm is less than ±0.1 m, and the attitude estimation error of the pitch and yaw angle is less than ±0.5°; and (3) through comparison experiments with the existing advanced methods, the positioning accuracy of this method is improved by 1.18% compared with the current advanced methods.