Detecting component defects and attaching tiny-scaled foreign objects to the overhead transmission lines are critical to the national grid’s safe operation and power distribution. This urgent task, however, faces challenges, such as the complex working environment and the considerable amount of workforce investment, for which we propose a deep-learning-aided object detection approach, YOLO-CSM, to address the issue. Combined with two attention mechanisms (Swin transformer and CBAM) and an extra detection layer, the proposed model can effectively capture global information and key visual features and promote its ability to identify tiny-scaled defects and distant objects in the visual fields. In order to validate this model, this work consolidates a dataset composed of public images and our field-taken picture samples. The experiment verifies YOLO-CSM as a suitable solution for small and distant object detection that outperforms several well-used algorithms, featuring a 16.3% faster detection speed than YOLOv5 and a 3.3% better detection accuracy than YOLOv7. Finally, this work conducts an interpretability experiment to reveal the similarity between YOLO-CSM’s attention patterns and that of humans, aiming to explain YOLO-CSM’s advantages in detecting small objects and minor defects in the working environments of power transmission lines.