Swin-Transformer-Enabled YOLOv5 with Attention Mechanism for Small Object Detection on Satellite Images

Gong, Hang; Mu, Tingkui; Li, Qiuxia; Dai, Haishan; Li, Chunlai; He, Zhike; Wang, Wenjing; Han, Feng; Tuniyazi, Abudusalamu; Li, Haoyang; Lang, Xuechan; Li, Zhiyuan; Wang, Bin

doi:10.3390/rs14122861

Cited by 138 publications

(68 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Figure 9(e) is a picture taken at a high altitude. The vehicle on the road is very small, but it can still improve the detection effect, indicating that the model can detect small objects [35].…”

Section: G Algorithm Validity Analysismentioning

confidence: 99%

An Improved YOLOv5 Method for Small Object Detection in UAV Capture Scenes

et al. 2023

View full text Add to dashboard Cite

Aiming at the problem of a large number of small dense objects in high-altitude shooting and complex background noise interference in the captured scenes, an improved object detection algorithm for YOLOv5 UAV capture scenes is proposed. A Feature Enhancement Block (FEBlock) is first proposed to generate adaptive weights for different receptive field features by convolution, assigning major weights to shallow feature maps to improve small object feature extraction ability. The FEBlock is then integrated into Spatial Pyramid Pooling (SPP) to generate Enhanced Spatial Pyramid Pooling (ESPP), which performs feature enhancement for the result of each maximum pooling; and creates new features containing multiscale contextual information with better feature characterization capability by weighting fused contextual features. Secondly, the Self-Characteristic Expansion Plate (SCEP) is proposed, which achieves the fusion and expansion of feature information through compression, non-linear mapping, and expansion with its own module, further improving the network's capacity for feature extraction and generating a new spatial pyramid pooling (ESPP-S) by splicing with ESPP. Finally, a shallower feature map is added as a detection layer to the YOLOv5 network model's large, medium, and small detection layers to improve the network's detection performance for medium and long-range objects. Experiments were conducted on the VisDrone2021 dataset, and the results showed that the improved YOLOv5 model improved mAP0.5 by 4.6%, mAP0.5:0.95 by 2.9%, and precision by 2.7%. The mAP0.5 of the model trained at the input resolution of 1024 × 1024 reached 56.8%. The experiments show that the improved YOLOv5 model can improve object detection accuracy for UAV capture scenes.

show abstract

Section: G Algorithm Validity Analysismentioning

confidence: 99%

An Improved YOLOv5 Method for Small Object Detection in UAV Capture Scenes

et al. 2023

View full text Add to dashboard Cite

show abstract

“…These methods have achieved good results in natural image datasets such as MS COCO, PASCAL VOC, etc. [ 5 ]. However, when these methods are used in remote sensing images, the results are always hardly satisfactory.…”

Section: Introductionmentioning

confidence: 99%

Swin-Transformer-Based YOLOv5 for Small-Object Detection in Remote Sensing Images

Xuan

Zhang

Lang

et al. 2023

Sensors

View full text Add to dashboard Cite

This study aimed to address the problems of low detection accuracy and inaccurate positioning of small-object detection in remote sensing images. An improved architecture based on the Swin Transformer and YOLOv5 is proposed. First, Complete-IOU (CIOU) was introduced to improve the K-means clustering algorithm, and then an anchor of appropriate size for the dataset was generated. Second, a modified CSPDarknet53 structure combined with Swin Transformer was proposed to retain sufficient global context information and extract more differentiated features through multi-head self-attention. Regarding the path-aggregation neck, a simple and efficient weighted bidirectional feature pyramid network was proposed for effective cross-scale feature fusion. In addition, extra prediction head and new feature fusion layers were added for small objects. Finally, Coordinate Attention (CA) was introduced to the YOLOv5 network to improve the accuracy of small-object features in remote sensing images. Moreover, the effectiveness of the proposed method was demonstrated by several kinds of experiments on the DOTA (Dataset for Object detection in Aerial images). The mean average precision on the DOTA dataset reached 74.7%. Compared with YOLOv5, the proposed method improved the mean average precision (mAP) by 8.9%, which can achieve a higher accuracy of small-object detection in remote sensing images.

show abstract

“…The transformer model was first proposed in 2017 [49] in the field of natural language processing (NLP), and was extended to deal with a computer vision task in 2020 [50]. The vision transformer (ViT) models have also been introduced into the field of remote sensing image processing, for applications such as semantic segmentation [51] and object detection [52], and have achieved competitive results, compared with CNN models. Some researchers have also studied the performance of ViT models in remote sensing image landslide detection [53].…”

Section: Introductionmentioning

confidence: 99%

ShapeFormer: A Shape-Enhanced Vision Transformer Model for Optical Remote Sensing Image Landslide Detection

Li³

et al. 2023

IEEE J. Sel. Top. Appl. Earth Observations Remote Sensing

View full text Add to dashboard Cite

Landslides pose a serious threat to human life, safety, and natural resources. Remote sensing images can be used to effectively monitor landslides at a large scale, which is of great significance for pre-disaster warning and post-disaster assistance. In recent years, deep learning based methods have made great progress in the field of remote sensing image landslide detection. In remote sensing images, landslides display a variety of scales and shapes. In this paper, to better extract and keep the multi-scale shape information of landslides, a shape-enhanced vision transformer (ShapeFormer) model is proposed. For the feature extraction, a pyramid vision transformer (PVT) model is introduced, which directly models the global information of local elements at different scales. To learn the shape information of different landslides, a shape feature extraction branch is designed, which uses the adjacent feature maps at different scales in the PVT model to improve the boundary information. After the feature extraction step, a decoder with deconvolutional layers follows, which combines the multiple features and gradually recovers the original resolution of the combined features. A softmax layer is connected with the combined features to acquire the final pixelwise result. The proposed ShapeFormer model was tested on two public datasets-the Bijie dataset and the Nepal dataset-which have different spectral and spatial characteristics. The results, when compared with those of some of the state-of-the-art methods, show the potential of the proposed method for use with multisource optical remote sensing data for landslide detection.

show abstract

Swin-Transformer-Enabled YOLOv5 with Attention Mechanism for Small Object Detection on Satellite Images

Cited by 138 publications

References 36 publications

An Improved YOLOv5 Method for Small Object Detection in UAV Capture Scenes

An Improved YOLOv5 Method for Small Object Detection in UAV Capture Scenes

Swin-Transformer-Based YOLOv5 for Small-Object Detection in Remote Sensing Images

ShapeFormer: A Shape-Enhanced Vision Transformer Model for Optical Remote Sensing Image Landslide Detection

Contact Info

Product

Resources

About