Accurate mapping of tree species is critical for the sustainable development of the forestry industry. However, the lack of cloud-free optical images makes it challenging to map tree species accurately in cloudy mountainous regions. In order to improve tree species identification in this context, a classification method using spatiotemporal fusion and ensemble classifier is proposed. The applicability of three spatiotemporal fusion methods, i.e., the spatial and temporal adaptive reflectance fusion model (STARFM), the flexible spatiotemporal data fusion (FSDAF), and the spatial and temporal nonlocal filter-based fusion model (STNLFFM), in fusing MODIS and Landsat 8 images was investigated. The fusion results in Helong City show that the STNLFFM algorithm generated the best fused images. The correlation coefficients between the fusion images and actual Landsat images on May 28 and October 19 were 0.9746 and 0.9226, respectively, with an average of 0.9486. Dense Landsat-like time series at 8-day time intervals were generated using this method. This time series imagery and topography-derived features were used as predictor variables. Four machine learning methods, i.e., K-nearest neighbors (KNN), random forest (RF), artificial neural networks (ANNs), and light gradient boosting machine (LightGBM), were selected for tree species classification in Helong City, Jilin Province. An ensemble classifier combining these classifiers was constructed to further improve the accuracy. The ensemble classifier consistently achieved the highest accuracy in almost all classification scenarios, with a maximum overall accuracy improvement of approximately 3.4% compared to the best base classifier. Compared to only using a single temporal image, utilizing dense time series and the ensemble classifier can improve the classification accuracy by about 20%, and the overall accuracy reaches 84.32%. In conclusion, using spatiotemporal fusion and the ensemble classifier can significantly enhance tree species identification in cloudy mountainous areas with poor data availability.