2024
DOI: 10.1109/tmm.2023.3238548
|View full text |Cite
|
Sign up to set email alerts
|

TransIFC: Invariant Cues-aware Feature Concentration Learning for Efficient Fine-grained Bird Image Classification

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
12
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
8
2

Relationship

2
8

Authors

Journals

citations
Cited by 48 publications
(12 citation statements)
references
References 0 publications
0
12
0
Order By: Relevance
“…ViT [42] ViT-B_16 448 × 448 90.8 TransIFC [65] ViT-B_16 448 × 448 91.0 TransFG [44] ViT-B_16 448 × 448 91.1 TPSKG [46] ViT-B_16 448 × 448 91.3 RAMS-Trans [28] ViT-B_16 448 × 448 91.3 FFVT [45] ViT-B_16 448 × 448 91.4 DCAL [27] ViT-B_16 448 × 448 91.4 SIM-Trans [25] ViT-B_16 448 × 448 91.5 AFTrans [26] ViT-B_16 448 × 448 91.5 IELT [24] ViT The state-of-the-art methods at this stage are organized in Table 8. We can see that our model obtains 1.7% and 1.0% improvements compared to the state-of-the-art CNNbased model PRIS [66] and ViT [42], respectively, which are higher than the results for CUB-200-2011, indicating that our method does not fail due to the increase in the amount of data.…”
Section: Ablation Experiments and Analysismentioning
confidence: 99%
“…ViT [42] ViT-B_16 448 × 448 90.8 TransIFC [65] ViT-B_16 448 × 448 91.0 TransFG [44] ViT-B_16 448 × 448 91.1 TPSKG [46] ViT-B_16 448 × 448 91.3 RAMS-Trans [28] ViT-B_16 448 × 448 91.3 FFVT [45] ViT-B_16 448 × 448 91.4 DCAL [27] ViT-B_16 448 × 448 91.4 SIM-Trans [25] ViT-B_16 448 × 448 91.5 AFTrans [26] ViT-B_16 448 × 448 91.5 IELT [24] ViT The state-of-the-art methods at this stage are organized in Table 8. We can see that our model obtains 1.7% and 1.0% improvements compared to the state-of-the-art CNNbased model PRIS [66] and ViT [42], respectively, which are higher than the results for CUB-200-2011, indicating that our method does not fail due to the increase in the amount of data.…”
Section: Ablation Experiments and Analysismentioning
confidence: 99%
“…For example, even when a sentence contains a substantial amount of colloquial information (Figure 1a), we can still capture the key meaning of the sentence, such as the words "team" and "scoring". Grabbing special features in the image can achieve a major effect breakthrough, such as in [13][14][15][16]. Specifically, accurate prediction can be achieved despite the serious interference of colloquial information by utilizing the semantic relationships of the remaining key words.…”
Section: Observations and Insightsmentioning
confidence: 99%
“…CNN remains one of the most important and effective models in computer vision. The multiscale feature fusion pyramid in this study is also based on the backbone network of CNN to introduce residual connectivity between levels, thus separating different levels of features from shallow to deep, mining the channel and pixel relationships of different scale features, and performing feature fusion to obtain multiscale fusion features [ 31 ].…”
Section: Related Workmentioning
confidence: 99%