Highways are an important component of any country. However, some highways in Indonesia endanger users while maintaining road safety. Crack detection early in the deterioration process can prevent further damage and lower maintenance costs. A recent study sought to develop a method for detecting road damage by combining the road damage detection (RDD) dataset with generative adversarial network technology and data augmentation to improve training. The current study aims to broaden the you only look once (YOLO) framework by incorporating the Swin Transformer into the chiral stationary phases (CSP) component of YOLOv7, with the goal of improving object detection accuracy in a variety of visual scenarios. The study compares the performance of various object detection models with varying parameters and configurations, such as YOLOv5l, YOLOv6l, YOLOv7-tiny, YOLOv7, and YOLOv7x. YOLOv5l has 46 million parameters and 108 billion floating point operations per second (FLOPS), whereas YOLOv6l has 59.5 million parameters and 150 billion FLOPS. With 31 million parameters and 140 billion FLOPS, the YOLOv7-swin model performs best with mean average precision (mAP), mAP_0.50 of 0.47. and mAP_0.5:0.95 of 0.232. The experimental results show that our YOLOv7-swin model outperforms both YOLOv7x and YOLOv7-tiny. The proposed model significantly improves object detection accuracy while keeping complexity and performance in balance.