In computer vision, fine-grained classification has become an important issue in recognizing objects with slight visual differences. Usually, it is challenging to generate good performance when solving fine-grained classification problems using traditional convolutional neural networks. To improve the accuracy and training time of convolutional neural networks in solving fine-grained classification problems, this paper proposes a tree-structured framework by eliminating the effect of differences between clusters. The contributions of the proposed method include the following three aspects: (1) a self-supervised method that automatically creates a classification tree, eliminating the need for manual labeling; (2) a machine-learning matcher which determines the cluster to which an item belongs, minimizing the impact of inter-cluster variations on classification; and (3) a pruning criterion which filters the tree-structured classifier, retaining only the models with superior classification performance. The experimental evaluation of the proposed tree-structured framework demonstrates its effectiveness in reducing training time and improving the accuracy of fine-grained classification across various datasets in comparison with conventional convolutional neural network models. Specifically, for the CUB 200 2011, FGVC aircraft, and Stanford car datasets, the proposed method achieves a reduction in training time of 32.91%, 35.87%, and 14.48%, and improves the accuracy of fine-grained classification by 1.17%, 2.01%, and 0.59%, respectively.