Remote sensing image classification (RSIC) is a classical and fundamental task in the intelligent interpretation of remote sensing imagery, which can provide unique labeling information for each acquired remote sensing image. Thanks to the potent global context information extraction ability of the multi-head self-attention (MSA) mechanism, visual transformer (ViT)-based architectures have shown excellent capability in natural scene image classification. However, in order to achieve powerful RSIC performance, it is insufficient to capture global spatial information alone. Specifically, for fine-grained target recognition tasks with high inter-class similarity, discriminative and effective local feature representations are key to correct classification. In addition, due to the lack of inductive biases, the powerful global spatial context representation capability of ViT requires lengthy training procedures and large-scale pre-training data volume. To solve the above problems, a hybrid architecture of convolution neural network (CNN) and ViT is proposed to improve the RSIC ability, called P2FEViT, which integrates plug-and-play CNN features with ViT. In this paper, the feature representation capabilities of CNN and ViT applying for RSIC are first analyzed. Second, aiming to integrate the advantages of CNN and ViT, a novel approach embedding CNN features into the ViT architecture is proposed, which can make the model synchronously capture and fuse global context and local multimodal information to further improve the classification capability of ViT. Third, based on the hybrid structure, only a simple cross-entropy loss is employed for model training. The model can also have rapid and comfortable convergence with relatively less training data than the original ViT. Finally, extensive experiments are conducted on the public and challenging remote sensing scene classification dataset of NWPU-RESISC45 (NWPU-R45) and the self-built fine-grained target classification dataset called BIT-AFGR50. The experimental results demonstrate that the proposed P2FEViT can effectively improve the feature description capability and obtain outstanding image classification performance, while significantly reducing the high dependence of ViT on large-scale pre-training data volume and accelerating the convergence speed. The code and self-built dataset will be released at our webpages.
Automatic sea-land segmentation is an essential and challenging field for the practical use of panchromatic satellite imagery. Owing to the temporal variations as well as the complex and inconsistent intensity contrast in both land and sea areas, it is difficult to generate an accurate segmentation result by using the conventional thresholding methods. Additionally, the freely available digital elevation model (DEM) also difficultly meets the requirements of high-resolution data for practical usage, because of the low precision and high memory storage costs for the processing systems. In this case, we proposed a fully automatic sea-land segmentation approach for practical use with a hierarchical coarse-to-fine procedure. We compared our method with other state-ofthe-art methods with real images under complex backgrounds and conducted quantitative comparisons. The experimental results show that our method outperforms all other methods and proved being computationally efficient.
Masked image modeling (MIM) has been proved to be an optimal pretext task for self-supervised pretraining (SSP), which can facilitate the model to capture an effective task-agnostic representation at the pretraining step and then advance the fine-tuning performance of various downstream tasks. However, under the high randomly masked ratio of MIM, the scene-level MIM-based SSP is hard to capture the small-scale objects or local details from complex remote sensing scenes. Then, when the pretrained models capturing more scene-level information are directly applied for object-level fine-tuning step, there is an obvious representation learning misalignment between model pretraining and fine-tuning steps. Therefore, in this article, a novel object-centric masked image modeling (OCMIM) strategy is proposed to make the model better capture the object-level information at the pretraining step and then further advance the object detection fine-tuning step. First, to better learn the object-level representation involving full scales and multicategories at MIM-based SSP, a novel object-centric data generator is proposed to automatically setup targeted pretraining data according to objects themselves, which can provide the specific data condition for object detection model pretraining. Second, an attention-guided mask generator is designed to generate a guided mask for MIM pretext task, which can lead the model to learn more discriminative representation of highly attended object regions than by using the randomly masking strategy. Finally, several experiments are conducted on six remote sensing object detection benchmarks, and results proved that the proposed OCMIM-based SSP strategy is a better pretraining way for remote sensing object detection than normally used methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.