The coarse scale of passive microwave surface soil moisture (SSM) is not suitable for regional agricultural and hydrological applications such as drought monitoring and irrigation management. The optical/thermal infrared (OTI) data-based passive microwave SSM downscaling method can effectively improve its spatial resolution to fine scale for regional applications. However, the estimation capability of SSM with long time series is limited by OTI data, which are heavily polluted by clouds. To reduce the dependence of the method on OTI data, an SSM retrieval and spatio-temporal fusion model (SMRFM) is proposed in the study. Specifically, a model coupling in situ data, MODerate-resolution Imaging Spectro-radiometer (MODIS) OTI data, and topographic information is developed to retrieve MODIS SSM (1 km) using the least squares method. Then the retrieved MODIS SSM and the spatio-temporal fusion model are employed to downscale the passive microwave SSM from coarse scale to 1 km. The proposed SMRFM is implemented in a grassland dominated area over Naqu, central Tibet Plateau, for Advanced Microwave Scanning Radiometer—Earth Observing System sensor (AMSR-E) SSM downscaling in unfrozen period. The in situ SSM and Noah land surface model 0.01° SSM are used to validate the estimated MODIS SSM with long time series. The evaluations show that the estimated MODIS SSM has the same temporal resolution with AMSR-E and obtains significantly improved detailed spatial information. Moreover, the temporal accuracy of estimated MODIS SSM against in situ data (r = 0.673, μbRMSE = 0.070 m3/m3) is better than the AMSR-E (r = 0.661, μbRMSE = 0.111 m3/m3). In addition, the temporal r of estimated MODIS SSM is obviously higher than that of Noah data. Therefore, this suggests that the SMRFM can be used to estimate MODIS SSM with long time series by AMSR-E SSM downscaling in the study. Overall, the study can provide help for the development and application of microwave SSM-related scientific research at the regional scale.