The vision-based recognition and localization system plays a crucial role in the unmanned harvesting of aquatic vegetables. After field investigation, factors such as illumination, shading, and computational cost have become the main difficulties restricting the identification and positioning of Brasenia schreberi. Therefore, this paper proposes a new lightweight detection method, YOLO-GS, which integrates feature information from both RGB and depth images for recognition and localization tasks. YOLO-GS employs the Ghost convolution module as a replacement for traditional convolution and innovatively introduces the C3-GS, a cross-stage module, to effectively reduce parameters and computational costs. With the redesigned detection head structure, its feature extraction capability in complex environments has been significantly enhanced. Moreover, the model utilizes Focal EIoU as the regression loss function to mitigate the adverse effects of low-quality samples on gradients. We have developed a data set of Brasenia schreberi that covers various complex scenarios, comprising a total of 1500 images. The YOLO-GS model, trained on this dataset, achieves an average accuracy of 95.7%. The model size is 7.95 MB, with 3.75 M parameters and a 9.5 GFLOPS computational cost. Compared to the original YOLOv5s model, YOLO-GS improves recognition accuracy by 2.8%, reduces the model size and parameter number by 43.6% and 46.5%, and offers a 39.9% reduction in computational requirements. Furthermore, the positioning errors of picking points are less than 5.01 mm in the X direction, 3.65 mm in the Y direction, and 1.79 mm in the Z direction. As a result, YOLO-GS not only excels with high recognition accuracy but also exhibits low computational demands, enabling precise target identification and localization in complex environments so as to meet the requirements of real-time harvesting tasks.