Recently, end-to-end detectors based on transformer (DETRs) have made remarkable progress. However, their high computational cost still limits the performance of the DETRs series as real-time object detectors. In order to solve this problem, we introduce the Van-DETR model, which enhances the first real-time end-to-end object detector, RT-DETR. Specifically, we innovatively introduce a new, more lightweight backbone—VanillaNet, replacing the former backbone—ResNet. To address its weak nonlinearity and poor local analysis, we combine large kernel convolutions with small kernel convolutions to integrate global and local information, significantly enhancing feature extraction capabilities. Secondly, in the hybrid encoder, we cascade group process the features extracted by the backbone and design a gated linear unit with a star-shaped connection for intra-scale feature interaction. During the cross-scale feature fusion stage, we propose a high-low frequency feature fusion module with strong feature representation capabilities. To verify the effectiveness of the model, we conduct experiments on two public object detection datasets—visdrone dataset and a people dataset from roboflow. Experimental results show that the proposed Van-DETR model achieves MAP50 of 0.471 and 0.730 on two object detection datasets, respectively, representing improvements of 4.5% and 2.8% over the original RT-DETR model. Source code is available at https://github.com/vangoghzz/Van-DETR.