Transformer with Peak Suppression and Knowledge Guidance for Fine-grained Image Recognition

Liu, Xinda; Wang, Lili; Han, Xiaoguang

doi:10.48550/arxiv.2107.06538

Cited by 3 publications

(7 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, TransFG [16] proposes the first ViT-based fine-grained classification method, in which a part selection module is designed to select discriminative tokens. Following that, several methods [19,29,39] also apply ViT to FGVC. Thereinto, TPSKG [29] proposes a peak suppression module which penalizes the attention to the most discriminative part and a knowledge guidance module to obtain the knowledge response coefficients.…”

Section: Related Workmentioning

confidence: 99%

“…Following that, several methods [19,29,39] also apply ViT to FGVC. Thereinto, TPSKG [29] proposes a peak suppression module which penalizes the attention to the most discriminative part and a knowledge guidance module to obtain the knowledge response coefficients. RAMS-Trans [19] learns discriminative region attention in a multi-scale way.…”

Section: Related Workmentioning

confidence: 99%

“…ViT [7] ViT-B_16 89.8 RAMS-Trans [19] ViT-B_16 91.3 TPSKG [29] ViT-B_16 91.3 FFVT [39] ViT-B_16 91.6 TransFG [16] ViT-B_16 91.7…”

Section: Training and Inferencementioning

confidence: 99%

“…Inspired by this, Vision Transformer (ViT) [7] with multiple self-attention layers has also been introduced into Computer Vision and attracted extensive attention. More recently, several works [16,19,29,39] have tried to apply ViT to FGVC and make a breakthrough. These primary ViT-based attempts have significantly surpassed existing convolution-based methods, demonstrating the superiority of ViT in FGVC.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator

Zhang¹,

Chen²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recently, several Vision Transformer (ViT) based methods have been proposed for Fine-Grained Visual Classification (FGVC). These methods significantly surpass existing CNN-based ones, demonstrating the effectiveness of ViT in FGVC tasks. However, there are some limitations when applying ViT directly to FGVC. First, ViT needs to split images into patches and calculate the attention of every pair, which may result in heavy redundant calculation and unsatisfying performance when handling fine-grained images with complex background and small objects. Second, a standard ViT only utilizes the class token in the final layer for classification, which is not enough to extract comprehensive fine-grained information.To address these issues, we propose a novel ViT based fine-grained object discriminator for FGVC tasks, ViT-FOD for short. Specifically, besides a ViT backbone, it further introduces three novel components, i.e, Attention Patch Combination (APC), Critical Regions Filter (CRF), and Complementary Tokens Integration (CTI). Thereinto, APC pieces informative patches from two images to generate a new image so that the redundant calculation can be reduced. CRF emphasizes tokens corresponding to discriminative regions to generate a new class token for subtle feature learning. To extract comprehensive information, CTI integrates complementary information captured by class tokens in different ViT layers. We conduct comprehensive experiments on widely used datasets and the results demonstrate that ViT-FOD is able to achieve state-of-the-art performance.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

“…ViT [7] ViT-B_16 89.8 RAMS-Trans [19] ViT-B_16 91.3 TPSKG [29] ViT-B_16 91.3 FFVT [39] ViT-B_16 91.6 TransFG [16] ViT-B_16 91.7…”

Section: Training and Inferencementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator

Zhang¹,

Chen²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Other researches focus on extracting more useful features from multi-channel networks [5,57] or contrastive learning [1,13]. TransFG [17] and TPSKG [27] have recently used the Transformer architecture to improve classification performance. The fine-grained methods generally suffer from a complex pipeline and enormous manual design.…”

Section: Related Workmentioning

confidence: 99%

Dynamic MLP for Fine-Grained Image Classification by Leveraging Geographical and Temporal Information

Yang¹,

Li²,

Song³

et al. 2022

Preprint

View full text Add to dashboard Cite

A survey: researches advanced in weakly supervised fine-grained visual classification

Ran

2022

2nd International Conference on Artificial Intelligence, Automation, and High-Performance Computing (AIAHPC 2022)

View full text Add to dashboard Cite

With the development of multimedia technology, Fine-Grained Visual Classification (FGVC) has gradually become one of the new hot tasks in computer vision community, whose goal is to identify images that belong to the same species. Though the accuracy of FGVC has made a great breakthrough, the performance is still limited by the issue of locating objects’ discriminative regions, as common state-of-the-art convolutional neural networks that perform excellently in image classification task such as ImageNet-1k cannot be directly applied to FGVC tasks. In this case, we provide a comprehensive and systematic survey of recent advances in FGVC field and divide the existing methods into: creative application of attentive structures, aids of diverse pretraining methods, various designs of loss functions and other innovative methods. We further analyze the performance of representative methods on common data sets, and finally summarize the existing problems in the FGVC research field and predict the solutions to these problems in the future.

show abstract

Transformer with Peak Suppression and Knowledge Guidance for Fine-grained Image Recognition

Cited by 3 publications

References 45 publications

ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator

ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator

Dynamic MLP for Fine-Grained Image Classification by Leveraging Geographical and Temporal Information

A survey: researches advanced in weakly supervised fine-grained visual classification

Contact Info

Product

Resources

About