Remote sensing images (RSIs) are characterized by complex spatial layouts and ground object structures. ViT can be a good choice for scene classification owing to the ability to capture long-range interactive information between patches of input images. However, due to the lack of some inductive biases inherent to CNNs, such as locality and translation equivariance, ViT cannot generalize well when trained on insufficient amounts of data. Compared with training ViT from scratch, transferring a large-scale pre-trained one is more cost-efficient with better performance even when the target data is small-scale. In addition, In addition, the cross-entropy (CE) loss is frequently utilized in scene classification yet has low robustness to noise labels and poor generalization performances for different scenes. In this paper, a ViT-based model in combination with supervised contrastive learning (CL) is proposed, named ViT-CL. For CL, Supervised contrastive (SupCon) loss which is developed by extending the self-supervised contrastive approach to the fullysupervised setting, can explore the label information of RSIs in embedding space and improve the robustness to common image corruption. In ViT-CL, a joint loss function that combines CE loss and SupCon loss is developed to prompt the model to learn more discriminative features. Also, a two-stage optimization framework is introduced to enhance the controllability of the optimization process of the ViT-CL model. Extensive experiments on the AID, NWPU-RESISC45, and UCM datasets verified the superior performance of ViT-CL, with the highest accuracies of 97.42%, 94.54%, and 99.76% among all competing methods, respectively.
Recently, deep learning has achieved considerable results in the hyperspectral image (HSI) classification. However, most available deep networks require ample and authentic samples to better train the models, which is expensive and inefficient in practical tasks. Existing few‐shot learning (FSL) methods generally ignore the potential relationships between non‐local spatial samples that would better represent the underlying features of HSI. To solve the above issues, a novel deep transformer and few‐shot learning (DT‐FSL) classification framework is proposed, attempting to realize fine‐grained classification of HSI with only a few‐shot instances. Specifically, the spatial attention and spectral query modules are introduced to overcome the constraint of the convolution kernel and consider the information between long‐distance location (non‐local) samples to reduce the uncertainty of classes. Next, the network is trained with episodes and task‐based learning strategies to learn a metric space, which can continuously enhance its modelling capability. Furthermore, the developed approach combines the advantages of domain adaptation to reduce the variation in inter‐domain distribution and realize distribution alignment. On three publicly available HSI data, extensive experiments have indicated that the proposed DT‐FSL yields better results concerning state‐of‐the‐art algorithms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.