Concealed object detection in millimeter wave (MMW) images has gained significant attention in the realm of public safety, primarily due to its distinctive advantages of non-hazardous and non-contact operation. However, this undertaking confronts substantial challenges in practical applications, owing to the inherent limitations of low imaging resolution, small concealed object size, intricate environmental noise, and the need for real-time performance. In this study, we propose Swin-YOLO, an innovative single-stage detection model built upon transformer layers. Our approach encompasses several key contributions. Firstly, the integration of Local Perception Swin Transform Layers (LPST Layers) enhanced the network’s capability to acquire contextual information and local awareness. Secondly, we introduced a novel feature fusion layer and a specialized prediction head for detecting small targets, effectively leveraging the network’s shallow feature information. Lastly, a coordinate attention (CA) module was seamlessly incorporated between the neck network and the detection head, augmenting the network’s sensitivity towards critical regions of small objects. To validate the efficacy and feasibility of our proposed method, we created a new MMW dataset containing a large number of small concealed objects and conducted comprehensive experiments to evaluate the effectiveness of overall and partial improvements, as well as computational efficiency. The results demonstrated a remarkable 4.7% improvement in the mean Average Precision (mAP) for Swin-YOLO compared with the YOLOv5 baseline. Moreover, when compared with other enhanced transformer-based models, Swin-YOLO exhibited a superior accuracy and the fastest inference speed. The proposed model showcases enhanced performance and holds promise for advancing the capabilities of real-world applications in public safety domains.