Weather forecasting is a classical problem in remote sensing, in which precipitation is difficult to predict accurately because of its complex physical motion. Precipitation significantly impacts human life, work, and the ecological environment. Precise precipitation forecasting is vital for water resource management, ecological protection, and disaster mitigation through precise precipitation forecasting. This study introduces an innovative deep learning-based precipitation-forecasting method DFMM-Precip that integrates reanalysis of precipitation data and satellite data using a multi-modal fusion layer and predicts future precipitation details through a global–local joint temporal-spatial attention mechanism. By effectively combining satellite infrared data with reanalysis data, the approach enhances the accuracy of precipitation forecasting. Experimental results for 24 h precipitation forecasts show that DFMM-Precip’s multi-modal fusion layer successfully integrates multi-modal data related to precipitation, leading to improved forecast accuracy. In particular, the global–local joint temporal-spatial attention mechanism provides precise, detailed forecasting of spatial and temporal precipitation patterns, outperforming other state-of-the-art models. The MSE of the forecasting results is 10 times lower than that of the advanced RNN model and 2.4 times lower than that of the advanced CNN model with single-modal data input. The probability of successful rainfall prediction is improved by more than 10%.