Estimating the net primary production (NPP) of vegetation is essential for eco-environment conservation and carbon cycle research. Remote sensing techniques, combined with algorithm models, have been proven to be promising methods for NPP estimation. High-precision and real-time NPP monitoring in heterogeneous areas requires high spatio-temporal resolution remote sensing data, which are not easy to acquire by single remote sensors, especially in cloudy weather. This study proposes to fuse images of different sensors to provide high spatio-temporal resolution data for NPP estimation in cloud-prone areas. Firstly, the time series Normalized Difference Vegetation Index (NDVI) with a spatial resolution of 30 m and a temporal resolution of 16 days, are obtained by the enhanced spatial and temporal adaptive reflectance fusion model (ESTARFM). Then, the time series NDVI data, combined with meteorological data are input into an improved CarnegieâAmesâStanford Approach (CASA) model for NPP estimation. This method is validated by a case study of a heavily urbanized area, in the middle reaches of the Yangtze River in China. The results indicate that the NPP estimated by the fused NDVI data has more detailed spatial information than by using the MODIS data. The results show a strong correlation between the actual Landsat8 NDVI and the fused NDVI images, which means that the accuracy of synthetic NDVI images (a 16 day interval and a 30 m resolution) is reliable, and it can provide superior inputs for accurate estimations of a NPP time series. The correlation coefficient (R) and root mean square error between the NPP, based on the fused NDVI and the measured NPP, are 0.66 and 14.280 g C/(m2·yr), respectively, indicating a good consistency. The small discrepancy is caused by the uncertainties of fused NDVI, measurement errors, conversion errors, and other factors in the CASA model. In this study, we achieved NPP with high spatial and temporal resolutions, which can provide higher accuracies of NPP data for analyzing the carbon cycling heavily urbanized areas, compared with similar studies using mono-temporal NPP data. The spatio-temporal fusion technique is an effective way of generating high spatio-temporal resolution images from different sensors, thereby providing enough data for NPP monitoring in urbanized areas.