To address the problem that current target detection algorithms do not work well in the task of detecting illegal building targets in the ancient city, a YOLO-UB-based algorithm for illegal building detection and recognition in the ancient city is proposed based on remote sensing image data taken by UAV. The algorithm incorporates the Coordinate attention (CA) attention mechanism to improve network robustness enhance the model’s detection ability for illegal building targets, and make target localization more accurate, and introduces the Swin Transformer V2 structure to use its own self-attention mechanism to deeply mine the target features, which enhances the global information capture capability and enables the network to better integrate multi-scale features. The algorithm is trained iteratively on a custom dataset and compared with other models. The results show that this algorithm achieves an average accuracy (mAP) of 96.8% in detecting illegal building targets, Compared to the algorithm with YOLOv7, the accuracy is improved by 3.1%, and the algorithm has better featureextraction, robustness and generalization than other target detection models.