2021
DOI: 10.48550/arxiv.2103.07976
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

TransFG: A Transformer Architecture for Fine-grained Recognition

Abstract: Fine-grained visual classification (FGVC) which aims at recognizing objects from subcategories is a very challenging task due to the inherently subtle inter-class differences. Recent works mainly tackle this problem by focusing on how to locate the most discriminative image regions and rely on them to improve the capability of networks to capture subtle variances. Most of these works achieve this by reusing the backbone network to extract features of selected regions. However, this strategy inevitably complica… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
73
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 31 publications
(73 citation statements)
references
References 39 publications
0
73
0
Order By: Relevance
“…Nonetheless, few studies explore the vision transformer on FGVC. TransFG [12] is the first study to extend the ViT into FGVC on large-scale FGVC datasets. However, we argue that TransFG cannot capture enough discriminative information on some challenging datasets, i.e., small-scale and ultra-fine-grained datasets.…”
Section: Transformermentioning
confidence: 99%
See 3 more Smart Citations
“…Nonetheless, few studies explore the vision transformer on FGVC. TransFG [12] is the first study to extend the ViT into FGVC on large-scale FGVC datasets. However, we argue that TransFG cannot capture enough discriminative information on some challenging datasets, i.e., small-scale and ultra-fine-grained datasets.…”
Section: Transformermentioning
confidence: 99%
“…3.2 FFVT Architecture [12] suggests that the ViT cannot capture enough local information required for FGVC. To cope with this problem, we propose to fuse the low-level features and middle-level features to enrich the local information.…”
Section: Vit For Image Recognitionmentioning
confidence: 99%
See 2 more Smart Citations
“…DeiT [42] strengthens ViT by introducing a powerful training recipe and adopting knowledge distillation. Built upon the success of ViT, many efforts have been devoted to improving ViT and adapting it into various vision tasks including image classification [42,43,15,53,26,32,20], object localization/detection [18,51,32,17] and image segmentation [51,32,40,7].…”
Section: Related Workmentioning
confidence: 99%