2022
DOI: 10.3390/rs14225817
|View full text |Cite
|
Sign up to set email alerts
|

On the Co-Selection of Vision Transformer Features and Images for Very High-Resolution Image Scene Classification

Abstract: Recent developments in remote sensing technology have allowed us to observe the Earth with very high-resolution (VHR) images. VHR imagery scene classification is a challenging problem in the field of remote sensing. Vision transformer (ViT) models have achieved breakthrough results in image recognition tasks. However, transformer–encoder layers encode different levels of features, where the latest layer represents semantic information, in contrast to the earliest layers, which contain more detailed data but ig… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2023
2023
2025
2025

Publication Types

Select...
7

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(4 citation statements)
references
References 46 publications
0
4
0
Order By: Relevance
“…However, this approach results in a substantial increase in the model's trainable parameters, potentially compromising its practical deployment due to increased computational demands and memory requirements. To address this challenge, LFAGCU introduces the (GFL) that aims to ensure the modeling of long-range non-local dependencies by leveraging an effective receptive field spanning the dimensions H × W. Notably, while ViT has demonstrated remarkable effectiveness in diverse computer vision tasks 35,36 , it presents limitations in terms of spatial inductive bias and its propensity for fine-tuning, hindering its full potential for certain tasks 37 . To overcome the limitations of weighted averaging of each pixel within the receptive field during convolution operations, which can lead to noise pixels affecting the distinguishability of image target pixels, the GFL module utilizes a multi-head self-attention mechanism for comprehensive global context modeling.…”
Section: Global Context Modelingmentioning
confidence: 99%
“…However, this approach results in a substantial increase in the model's trainable parameters, potentially compromising its practical deployment due to increased computational demands and memory requirements. To address this challenge, LFAGCU introduces the (GFL) that aims to ensure the modeling of long-range non-local dependencies by leveraging an effective receptive field spanning the dimensions H × W. Notably, while ViT has demonstrated remarkable effectiveness in diverse computer vision tasks 35,36 , it presents limitations in terms of spatial inductive bias and its propensity for fine-tuning, hindering its full potential for certain tasks 37 . To overcome the limitations of weighted averaging of each pixel within the receptive field during convolution operations, which can lead to noise pixels affecting the distinguishability of image target pixels, the GFL module utilizes a multi-head self-attention mechanism for comprehensive global context modeling.…”
Section: Global Context Modelingmentioning
confidence: 99%
“…Object-oriented and visual attention-based methods have shown potential but are limited by manual feature extraction and model robustness issues. Here, we propose a novel approach that incorporates attention mechanism fusion and robotic multimodal information fusion decision-making in the framework of graph neural algorithms to address these challenges (Chaib et al, 2022 ; Chen et al, 2022 ; Tian et al, 2023 ).…”
Section: Related Workmentioning
confidence: 99%
“…Despite the progress made, the existing pixel-based methods can only reflect spectral information at an individual pixel level and lack a comprehensive understanding of the overall remote sensing image, leading to difficulties in obtaining meaningful Here, we propose a novel approach that incorporates attention mechanism fusion and robotic multimodal information fusion decision-making in the framework of graph neural algorithms to address these challenges (Chaib et al, 2022;Chen et al, 2022;Tian et al, 2023).…”
Section: Related Workmentioning
confidence: 99%
“…They used random forests and support vector machines (SVM), and their combined strengths were applied separately to Landsat-8, Sentinel-2, and Planet images separately to assess the individual and overall class accuracy of the images. CHAIB et al [12] proposed a new deep framework is proposed for very high-resolution (VHR) scene understanding by exploring the strengths of vision transformer (ViT) features in a simple and effective way. This pretrained ViT model is used to extract informative features from the original VHR image scene, where the transformer-encoder layers are used to generate the feature descriptors of the input images.…”
Section: Introductionmentioning
confidence: 99%