For the automated robotic picking of bunch-type fruit, the strategy is to roughly determine the location of the bunches, plan the picking route from a remote location, and then locate the picking point precisely at a more appropriate, closer location. The latter can reduce the amount of information to be processed and obtain more precise and detailed features, thus improving the accuracy of the vision system. In this study, a long–close distance coordination control strategy for a litchi picking robot was proposed based on an Intel Realsense D435i camera combined with a point cloud map collected by the camera. The YOLOv5 object detection network and DBSCAN point cloud clustering method were used to determine the location of bunch fruits at a long distance to then deduce the sequence of picking. After reaching the close-distance position, the Mask RCNN instance segmentation method was used to segment the more distinctive bifurcate stems in the field of view. By processing segmentation masks, a dual reference model of “Point + Line” was proposed, which guided picking by the robotic arm. Compared with existing studies, this strategy took into account the advantages and disadvantages of depth cameras. By experimenting with the complete process, the density-clustering approach in long distance was able to classify different bunches at a closer distance, while a success rate of 88.46% was achieved during fruit-bearing branch locating. This was an exploratory work that provided a theoretical and technical reference for future research on fruit-picking robots.