2022
DOI: 10.1109/tgrs.2022.3157671
|View full text |Cite
|
Sign up to set email alerts
|

SCViT: A Spatial-Channel Feature Preserving Vision Transformer for Remote Sensing Image Scene Classification

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
46
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
9
1

Relationship

0
10

Authors

Journals

citations
Cited by 95 publications
(46 citation statements)
references
References 46 publications
0
46
0
Order By: Relevance
“…Then, Bashmal et al [44] proposed the Dataefficient image transformers (DeiT), a ViT-based model trained by knowledge distillation with fewer data, and proved that the performance of ViT was superior to the CNN-based method on the remote sensing datasets AID and NWPU-RESISC. In [45], SCViT is proposed to overcome the disadvantage that the original model can only capture global spatial features. By The Similarity Between Scene "Railway Station" and "Industrial Area".…”
Section: Related Workmentioning
confidence: 99%
“…Then, Bashmal et al [44] proposed the Dataefficient image transformers (DeiT), a ViT-based model trained by knowledge distillation with fewer data, and proved that the performance of ViT was superior to the CNN-based method on the remote sensing datasets AID and NWPU-RESISC. In [45], SCViT is proposed to overcome the disadvantage that the original model can only capture global spatial features. By The Similarity Between Scene "Railway Station" and "Industrial Area".…”
Section: Related Workmentioning
confidence: 99%
“…LGCNet [33] is the first CNN-based remote sensing image SR model that utilizes local and global representations to learn image residuals between HR images and upscaled LR images. SCViT [34] proposes a spatial channel feature preservation model that considers the detailed geometric information of the high-spatial-resolution imagery. TransENet [35] employs a multiscale transformer to aggregate multidimensional spatial features while focusing on image spatial self-similarity.…”
Section: B Lightweight Srmentioning
confidence: 99%
“…In this subsection, we will introduce the swin transformer architecture, which is the backbone of Meta-TR network. Transformer was originally used in NLP (natural language processing) [42], and it has also shown its superiority in remote sensing image processing recent years [43]- [45]. However, the original transformer needs to pay attention to all pixels of image in the calculation, which leads to a sharp increase in calculation and increases the restrictions on deployment and application.…”
Section: B Swin Transformer Architecturementioning
confidence: 99%