In recent years, the automatic recognition of tree species based on images taken by digital cameras has been widely applied. However, many problems still exist, such as insufficient tree species image acquisition, uneven distribution of image categories, and low recognition accuracy. Tree leaves can be used to differentiate and classify tree species due to their cognitive signatures in color, vein texture, shape contour, and edge serration. Moreover, the way the leaves are arranged on the twigs has strong characteristics. In this study, we first built an image dataset of 21 tree species based on the features of the twigs and leaves. The tree species feature dataset was divided into the training set and test set, with a ratio of 8:2. Feature extraction was performed after training the convolutional neural network (CNN) using the k-fold cross-validation (K-Fold–CV) method, and tree species classification was performed with classifiers. To improve the accuracy of tree species identification, we combined three improved CNN models with three classifiers. Evaluation indicators show that the overall accuracy of the designed composite model was 1.76% to 9.57% higher than other CNN models. Furthermore, in the MixNet XL CNN model, combined with the K-nearest neighbors (KNN) classifier, the highest overall accuracy rate was obtained at 99.86%. In the experiment, the Grad-CAM heatmap was used to analyze the distribution of feature regions that play a key role in classification decisions. Observation of the Grad-CAM heatmap illustrated that the main observation area of SE-ResNet50 was the most accurately positioned, and was mainly concentrated in the interior of small twigs and leaflets. Our research showed that modifying the training method and classification module of the CNN model and combining it with traditional classifiers to form a composite model can effectively improve the accuracy of tree species recognition.