Grapes are an important cash crop that contributes to the rapid development of the agricultural economy. The harvesting of ripe fruits is one of the crucial steps in the grape production process. However, at present, the picking methods are mainly manual, resulting in wasted time and high costs. Therefore, it is particularly important to implement intelligent grape picking, in which the accurate detection of grape stems is a key step to achieve intelligent harvesting. In this study, a trellis grape stem detection model, YOLOv8n-GP, was proposed by combining the SENetV2 attention module and CARAFE upsampling operator with YOLOv8n-pose. Specifically, this study first embedded the SENetV2 attention module at the bottom of the backbone network to enhance the model’s ability to extract key feature information. Then, we utilized the CARAFE upsampling operator to replace the upsampling modules in the neck network, expanding the sensory field of the model without increasing its parameters. Finally, to validate the detection performance of YOLOv8n-GP, we examined the effectiveness of the various keypoint detection models constructed with YOLOv8n-pose, YOLOv5-pose, YOLOv7-pose, and YOLOv7-Tiny-pose. Experimental results show that the precision, recall, mAP, and mAP-kp of YOLOv8n-GP reached 91.6%, 91.3%, 97.1%, and 95.4%, which improved by 3.7%, 3.6%, 4.6%, and 4.0%, respectively, compared to YOLOv8n-pose. Furthermore, YOLOv8n-GP exhibits superior detection performance compared with the other keypoint detection models in terms of each evaluation indicator. The experimental results demonstrate that YOLOv8n-GP can detect trellis grape stems efficiently and accurately, providing technical support for advancing intelligent grape harvesting.