Precise regional crop yield estimates based on the high-spatiotemporal-resolution remote sensing data are essential for directing agronomic practices and policies to increase food security. This study used the enhanced spatial and temporal adaptive reflectance fusion model (ESTARFM), the flexible spatiotemporal data fusion (FSADF), and the spatial and temporal non-local filter based fusion model (STNLFFM) to calculate the normalized differential vegetation index (NDVI) of the summer maize planting area in the Southeast Loess Plateau based on the Sentinel-2 and MODIS data. The spatiotemporal resolution was 10 m and 1 d, respectively. Then, we evaluated the adaptability of the ESTARFM, FSADF, and STNLFFM fusion models in the field from the perspectives of spatial and textural characteristics of the data, summer maize NDVI growing curves, and yield estimation accuracy through qualitative visual discrimination and quantitative statistical analysis. The results showed that the fusion of ESTARFM–NDVI, FSDAF–NDVI, and STNLFFM–NDVI could precisely represent the variation tendency and local mutation information of NDVI during the growth period of summer maize, compared with MODIS–NDVI. The correlation between STNLFFM–NDVI and Sentinel-2–NDVI was favorable, with large correlation coefficients and a small root mean square error (RMSE). In the NDVI growing curve simulation of summer maize, STNLFFM introduced overall weights based on non-local mean filtering, which could significantly improve the poor fusion results at seedling and maturity stages caused by the long gap period of the high-resolution data in ESTARFM. Moreover, the accuracy of yield estimation was as follows (from high to low): STNLFFM (R = 0.742, mean absolute percentage error (MAPE) = 6.22%), ESTARFM (R = 0.703, MAPE = 6.80%), and FSDAF (R = 0.644, MAPE = 10.52%). The FADSF fusion model was affected by the spatial heterogeneity in the semi-humid areas, and the yield simulation accuracy was low. In the semi-arid areas, the FADSF fusion model had the advantages of less input data and a faster response.