TransFG: A Transformer Architecture for Fine-grained Recognition

He, Jifeng; Chen, Jie-Neng; Liu, Shuai; Kortylewski, Adam; Yang, Cheng; Bai, Yang; Wang, Changhu

doi:10.48550/arxiv.2103.07976

Cited by 31 publications

(73 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Nonetheless, few studies explore the vision transformer on FGVC. TransFG [12] is the first study to extend the ViT into FGVC on large-scale FGVC datasets. However, we argue that TransFG cannot capture enough discriminative information on some challenging datasets, i.e., small-scale and ultra-fine-grained datasets.…”

Section: Transformermentioning

confidence: 99%

“…3.2 FFVT Architecture [12] suggests that the ViT cannot capture enough local information required for FGVC. To cope with this problem, we propose to fuse the low-level features and middle-level features to enrich the local information.…”

Section: Vit For Image Recognitionmentioning

confidence: 99%

“…After that, most methods often adopt a rank loss [2] on the classification outputs for all local features. However, [12] argues that RPN-based methods ignore the relationships among selected regions. Another problem is that this mechanism drives the RPN to propose large bounding boxes as they are more likely to contain the foreground objects.…”

Section: Introductionmentioning

confidence: 99%

“…However, few study investigate the performance of vision transformer in FGVC. As the first work to study the vision transformer on FGVC, [12] proposed to replace the inputs of the final transformer layer with some important tokens and achieved state-of-the-art performance on some benchmarks. Nonetheless, the final class token may concern more on global information and pay less attention to local and low-level features, defecting the performance of vision transformer on FGVC since local information plays an important role in FGVC.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Feature Fusion Vision Transformer for Fine-Grained Visual Categorization

Wang

Gao

2021

Preprint

View full text Add to dashboard Cite

The core for tackling the fine-grained visual categorization (FGVC) is to learn subtle yet discriminative features. Most previous works achieve this by explicitly selecting the discriminative parts or integrating the attention mechanism via CNN-based approaches. However, these methods enhance the computational complexity and make the model dominated by the regions containing the most of the objects. Recently, vision transformer (ViT) has achieved SOTA performance on general image recognition tasks. The self-attention mechanism aggregates and weights the information from all patches to the classification token, making it perfectly suitable for FGVC. Nonetheless, the classification token in the deep layer pays more attention to the global information, lacking the local and low-level features that are essential for FGVC. In this work, we propose a novel pure transformer-based framework Feature Fusion Vision Transformer (FFVT) where we aggregate the important tokens from each transformer layer to compensate the local, low-level and middle-level information. We design a novel token selection module called mutual attention weight selection (MAWS) to guide the network effectively and efficiently towards selecting discriminative tokens without introducing extra parameters. We verify the effectiveness of FFVT on three benchmarks where FFVT achieves the state-of-the-art performance.

show abstract

Section: Transformermentioning

confidence: 99%

Section: Vit For Image Recognitionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Feature Fusion Vision Transformer for Fine-Grained Visual Categorization

Wang

Gao

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…DeiT [42] strengthens ViT by introducing a powerful training recipe and adopting knowledge distillation. Built upon the success of ViT, many efforts have been devoted to improving ViT and adapting it into various vision tasks including image classification [42,43,15,53,26,32,20], object localization/detection [18,51,32,17] and image segmentation [51,32,40,7].…”

Section: Related Workmentioning

confidence: 99%

TransMix: Attend to Mix for Vision Transformers

Chen¹,

Sun²,

He³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Mixup-based augmentation has been found to be effective for generalizing models during training, especially for Vision Transformers (ViTs) since they can easily overfit. However, previous mixup-based methods have an underlying prior knowledge that the linearly interpolated ratio of targets should be kept the same as the ratio proposed in input interpolation. This may lead to a strange phenomenon that sometimes there is no valid object in the mixed image due to the random process in augmentation but there is still response in the label space. To bridge such gap between the input and label spaces, we propose TransMix, which mixes labels based on the attention maps of Vision Transformers. The confidence of the label will be larger if the corresponding input image is weighted higher by the attention map. TransMix is embarrassingly simple and can be implemented in just a few lines of code without introducing any extra parameters and FLOPs to ViT-based models. Experimental results show that our method can consistently improve various ViT-based models at scales on ImageNet classification. After pre-trained with TransMix on ImageNet, the ViT-based models also demonstrate better transferability to semantic segmentation, object detection and instance segmentation. TransMix also exhibits to be more robust when evaluating on 4 different benchmarks. Code will be made publicly available at https://github.com/Beckschen/TransMix.

show abstract

Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset

Horn

Qian

Wilber

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

We present a new benchmark dataset, Sapsucker Woods 60 (SSW60), for advancing research on audiovisual fine-grained categorization. While our community has made great strides in fine-grained visual categorization on images, the counterparts in audio and video finegrained categorization are relatively unexplored. To encourage advancements in this space, we have carefully constructed the SSW60 dataset to enable researchers to experiment with classifying the same set of categories in three different modalities: images, audio, and video. The dataset covers 60 species of birds and is comprised of images from existing datasets, and brand new, expert curated audio and video datasets. We thoroughly benchmark audiovisual classification performance and modality fusion experiments through the use of state-of-the-art transformer methods. Our findings show that performance of audiovisual fusion methods is better than using exclusively image or audio based methods for the task of video classification. We also present interesting modality transfer experiments, enabled by the unique construction of SSW60 to encompass three different modalities. We hope the SSW60 dataset and accompanying baselines spur research in this fascinating area.

show abstract

TransFG: A Transformer Architecture for Fine-grained Recognition

Cited by 31 publications

References 39 publications

Feature Fusion Vision Transformer for Fine-Grained Visual Categorization

Feature Fusion Vision Transformer for Fine-Grained Visual Categorization

TransMix: Attend to Mix for Vision Transformers

Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset

Contact Info

Product

Resources

About