The current smart cooler's commodity identification system first locates the item being purchased, followed by feature extraction and matching. However, this method often suffers from inaccuracies due to the presence of background in the detection frame, leading to missed detections and misidentifications. To address these issues, we propose an end-to-end You Only Look Once (YOLO) for detection and segmentation algorithm. In the backbone network, we combine deformable convolution with a channel-to-pixel (C2f) module to enhance the model feature extraction capability. In the neck network, we introduce an optimized feature fusion structure, which is based on the weighted bi-directional feature pyramid. To further enhance the model's understanding of both global and local context, a triple feature encoding module is employed, seamlessly fusing multi-scale features for improved performance. The convolutional block attention module is connected to the improved C2f module to enhance the network's attention to the commodity image channel and spatial information. A supplementary segmentation branch is incorporated into the head of the network, allowing it to not only detect targets within the image but also generate precise segmentation masks for each detected object, thereby enhancing its multi-task capabilities. Compared with YOLOv8, for box and mask, the precision increases by 3% and 4.7%, recall increases by 2.8% and 4.7%, and mean average precision (mAP) increases by 4.9% and 14%. The frames per second is 119, which meets the demand for real-time detection. The results of comparative and ablation studies confirm the high accuracy and performance of the proposed algorithm, solidifying its foundation for fine-grained commodity identification.