The detection of tomatoes is of vital importance for enhancing production efficiency, with image recognition-based tomato detection methods being the primary approach. However, these methods face challenges such as the difficulty in extracting small targets, low detection accuracy, and slow processing speeds. Therefore, this paper proposes an improved RT-DETR-Tomato model for efficient tomato detection under complex environmental conditions. The model mainly consists of a Swin Transformer block, a BiFormer module, path merging, multi-scale convolutional layers, and fully connected layers. In this proposed model, Swin Transformer is chosen as the new backbone network to replace ResNet50 because of its superior ability to capture broader global dependency relationships and contextual information. Meanwhile, a lightweight BiFormer block is adopted in Swin Transformer to reduce computational complexity through content-aware flexible computation allocation. Experimental results show that the average accuracy of the final RT-DETR-Tomato model is greatly improved compared to the original model, and the model training time is greatly reduced, demonstrating better environmental adaptability. In the future, the RT-DETR-Tomato model can be integrated with intelligent patrol and picking robots, enabling precise identification of crops and ensuring the safety of crops and the smooth progress of agricultural production.