Fine-grained visual classification (FGVC) is challenging task due to discriminative feature representations. The attention-based methods show great potential for FGVC, which neglect that the deeply digging inter-layer feature relations have an impact on refining feature learning. Similarly, the associating cross-layer features methods achieve significant feature enhancement, which lost the long-distance dependencies between elements. However, most of the previous researches neglect that these two methods are mutually correlated to reinforce feature learning, which are independent of each other in related models. Thus, we adopt the respective advantages of the two methods to promote fine-gained feature representations. In this paper, we propose a novel CLNET network, which effectively applies attention mechanism and cross-layer features to obtain feature representations. Specifically, CL-NET consists of 1) adopting selfattention to capture long-rang dependencies for each element, 2) associating cross-layer features to reinforce feature learning,and 3) to cover more feature regions,we integrate attention-based operations between output and input. Experiments verify that CLNET yields new state-of-the-art performance in three widely used fine-grained benchmark datasets, including CUB-200-2011, Stanford Cars and FGVC-Aircraft. The url of our code is https://github.com/dlearing/CLNET.git. INDEX TERMSAssociating cross-layer features, attention-based operations, self-attention, CLNET.
Learning subtle discriminative feature representation plays a significant role in Fine‐Grained Visual Categorisation (FGVC). The vision transformer (ViT) achieves promising performance in the traditional image classification filed due to its multi‐head self‐attention mechanism. Unfortunately, ViT cannot effectively capture critical feature regions for FGVC due to only focusing on classification token and adopting the strategy of one‐time image input. Besides, the advantage of attention weights fusion is not applied to ViT. To promote the performance of capturing vital regions for FGVC, the authors propose a novel model named RDTrans, which proposes discriminative region with top priority in a recurrent learning way. Specifically, proposed vital regions at each scale will be cropped and amplified as the next input parameters to finally locate the most discriminative region. Furthermore, a distillation learning method is employed to provide better supervision for elevating the generalisation ability. Concurrently, RDTrans can be easily trained end‐to‐end in a weakly‐supervised learning way. Extensive experiments demonstrate that RDTrans yields state‐of‐the‐art performance on four widely used fine‐grained benchmarks, including CUB‐200‐2011, Stanford Cars, Stanford Dogs, and iNat2017.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.