Fine-Grained Visual Categorization (FGVC) involves learning detailed features for images that are difficult to distinguish within the same subclass. This enables networks to differentiate between instances of different classes that share very similar visual content. Therefore, learning how to extract nuanced representations of selected object details is crucial. This paper introduces a novel fine-grained visual classification model with Vision Transformer (ViT) as the backbone, namely Multistage Attention Region Supplement Transformer (MARS-Trans). Our main contributions are as follows. First, we observed that in ViT's multi-head attention module, each Attention Head's softmaxed feature results are directly concatenated and multiplied by weights. Consequently, we propose a Multistage Attention Module (MAM) to grade the attention heads based on their weights. Additionally, we introduce a Region Supplement Module (RSM) to suppress non-critical regions and enhance edge information in key areas, further emphasizing the discriminative features. Finally, we use our proposed Approximate Adjust Method (AAM) to refine the final features and improve classification results. We conducted thorough experiments with MARS-Trans on four popular public fine-grained image datasets, validating the effectiveness of these modules. SOTA results are achieved on one dataset, and competitive performance is demonstrated on the other three datasets.The code is available at https://github.com/ArrikenMei/MARS-Trans.