The study of surface vegetation monitoring in the “Three-River Headwaters” Region (TRHR) relies on satellite data with high spatial and temporal resolutions. The spatial and temporal fusion method for multiple data sources can effectively overcome the limitations of weather, the satellite return period, and funding on research data to obtain data higher spatial and temporal resolutions. This paper explores the spatial and temporal adaptive reflectance fusion model (STARFM), the enhanced spatial and temporal adaptive reflectance fusion model (ESTARFM), and the flexible spatiotemporal data fusion (FSDAF) method applied to Sentinel-2 and MODIS data in a typical area of the TRHR. In this study, the control variable method was used to analyze the parameter sensitivity of the models and explore the adaptation parameters of the Sentinel-2 and MODIS data in the study area. Since the spatiotemporal fusion model was directly used in the product data of the vegetation index, this study used NDVI fusion as an example and set up a comparison experiment (experiment I first performed the band spatiotemporal fusion and then calculated the vegetation index; experiment II calculated the vegetation index first and then performed the spatiotemporal fusion) to explore the feasibility and applicability of the two methods for the vegetation index fusion. The results showed the following. (1) The three spatiotemporal fusion models generated high spatial resolution and high temporal resolution data based on the fusion of Sentinel-2 and MODIS data, the STARFM and FSDAF model had a higher fusion accuracy, and the R2 values after fusion were higher than 0.8, showing greater applicability. (2) The fusion accuracy of each model was affected by the model parameters. The errors between the STARFM, ESTARFM, and FSDAF fusion results and the validation data all showed a decreasing trend with an increase in the size of the sliding window or the number of similar pixels, which stabilized after the sliding window became larger than 50 and the similar pixels became larger than 80. (3) The comparative experimental results showed that the spatiotemporal fusion model can be directly fused based on the vegetation index products, and higher quality vegetation index data can be obtained by calculating the vegetation index first and then performing the spatiotemporal fusion. The high spatial and temporal resolution data obtained using a suitable spatial and temporal fusion model are important for the identification and monitoring of surface cover types in the TRHR.