To facilitate the realization of automated tea picking and enhance the speed and accuracy of tea leaf grading detection, this study proposes an improved YOLOv8 network for fresh tea leaf grading recognition. This approach integrates a Hierarchical Vision Transformer using Shifted Windows to replace segments of the original YOLOv8’s network architecture, thereby alleviating the computational load of dense image processing tasks and reducing computational expenses. The incorporation of an Efficient Multi-Scale Attention Module with Cross-Spatial Learning serves to attenuate the influence of irrelevant features in complex backgrounds, which in turn, elevates the model’s detection Precision. Additionally, the substitution of the loss function with SIoU facilitates a more rapid model convergence and a more precise pinpointing of defect locations. The empirical findings indicate that the enhanced YOLOv8 algorithm has achieved a marked improvement in metrics such as Precision, Recall, F1, and mAP, with increases of 3.39%, 0.86%, 2.20%, and 2.81% respectively, when juxtaposed with the original YOLOv8 model. Moreover, in external validations, the FPS enhancements over the original YOLOv8, YOLOv5, YOLOX, Faster RCNN, and SSD deep-learning models are 6.75 Hz, 10.84 Hz, 12.79 Hz, 28.24 Hz, and 21.57 Hz, respectively, and the mAP improvements in practical detection are 2.79%, 2.92%, 3.08%, 7.07%, and 3.84% respectively. The refined model not only ensures efficient and accurate tea-grading recognition but also boasts high recognition rates and swift detection capabilities, thereby establishing a foundation for the development of tea-picking robots and tea quality grading devices.