Agricultural applications of remote sensing data typically require high spatial resolution and frequent observations. The increasing availability of high spatial resolution imagery meets the spatial resolution requirement well. However, the long revisit period and frequent cloud contamination severely compromise their ability to monitor crop growth, which is characterized by high temporal heterogeneity. Many spatiotemporal fusion methods have been developed to produce synthetic images with high spatial and temporal resolutions. However, these existing methods focus on fusing low and medium spatial resolution satellite data in terms of model development and validation. When it comes to fusing medium and high spatial resolution images, the applicability remains unknown and may face various challenges. To address this issue, we propose a novel spatiotemporal fusion method, the dual-stream spatiotemporal decoupling fusion architecture model, to fully realize the prediction of high spatial resolution images. Compared with other fusion methods, the model has distinct advantages: (a) It maintains high fusion accuracy and good spatial detail by combining deep-learning-based super-resolution method and partial least squares regression model through edge and color-based weighting loss function; and (b) it demonstrates improved transferability over time by introducing image gradient maps and partial least squares regression model. We tested the StarFusion model at 3 experimental sites and compared it with 4 traditional methods: STARFM (spatial and temporal adaptive reflectance fusion), FSDAF (flexible spatiotemporal data fusion), Fit-FC (regression model fitting, spatial filtering, and residual compensation), FIRST (fusion incorporating spectral autocorrelation), and a deep learning base method—super-resolution generative adversarial network. In addition, we also investigated the possibility of our method to use multiple pairs of coarse and fine images in the training process. The results show that multiple pairs of images provide better overall performance but both of them are better than other comparison methods. Considering the difficulty in obtaining multiple cloud-free image pairs in practice, our method is recommended to provide high-quality Gaofen-1 data with improved temporal resolution in most cases since the performance degradation of single pair is not significant.