The profound impacts of severe air pollution on human health, ecological balance, and economic stability are undeniable. Precise air quality forecasting stands as a crucial necessity, enabling governmental bodies and vulnerable communities to proactively take essential measures to reduce exposure to detrimental pollutants. Previous research has primarily focused on predicting air quality using only time-series data. However, the importance of remote-sensing image data has received limited attention. This paper proposes a new multi-modal deep-learning model, Res-GCN, which integrates high spatial resolution remote-sensing images and time-series air quality data from multiple stations to forecast future air quality. Res-GCN employs two deep-learning networks, one utilizing the residual network to extract hidden visual information from remote-sensing images, and another using a dynamic spatio-temporal graph convolution network to capture spatio-temporal information from time-series data. By extracting features from two different modalities, improved predictive performance can be achieved. To demonstrate the effectiveness of the proposed model, experiments were conducted on two real-world datasets. The results show that the Res-GCN model effectively extracts multi-modal features, significantly enhancing the accuracy of multi-step predictions. Compared to the best-performing baseline model, the multi-step prediction’s mean absolute error, root mean square error, and mean absolute percentage error increased by approximately 6%, 7%, and 7%, respectively.