Satellite time-series data contain information in three dimensions—spatial, spectral, and temporal—and are widely used for monitoring, simulating, and evaluating Earth activities. However, some time-phase images in the satellite time series data are missing due to satellite sensor malfunction or adverse atmospheric conditions, which prevents the effective use of the data. Therefore, we need to complement the satellite time series data with sequence image interpolation. Linear interpolation methods and deep learning methods that have been applied to sequence image interpolation lead to large errors between the interpolation results and the real images due to the lack of accurate estimation of pixel positions and the capture of changes in objects. Inspired by video frame interpolation, we combine optical flow estimation and deep learning and propose a method named Multi-Scale Optical Flow-Intermediate Feature Joint Network. This method learns pixel occlusion and detailed compensation information for each channel and jointly refines optical flow and intermediate features at different scales through an end-to-end network together. In addition, we set a spectral loss function to optimize the network’s learning of the spectral features of satellite images. We have created a time-series dataset using Landsat-8 satellite data and Sentinel-2 satellite data and then conducted experiments on this dataset. Through visual and quantitative evaluation of the experimental results, we discovered that the interpolation results of our method retain better spectral and spatial consistency with the real images, and that the results of our method on the test dataset have a 7.54% lower Root Mean Square Error than other approaches.