To effectively address air pollution and enhance air quality, governments must be able to predict the air quality index with high accuracy and reliability. However, air quality prediction is subject to ambiguity and instability because of the atmosphere’s fluidity, making it challenging to identify the temporal and spatial correlations using a single model. Therefore, a new hybrid model is proposed based on an interpretable neural network and a graph neural network (INNGNN), which simulates the temporal and spatial dependence of air quality and achieves accurate multi-step air quality prediction. A time series is first interpreted using interpretable neural networks (INN) to extract the potentially important aspects that are easily overlooked in the data; second, a self-attention mechanism catches the local and global dependencies and associations in the time series. Lastly, a city map is created using a graph neural network (GNN) to determine the relationships between cities in order to extract the spatially dependent features. In the experimental evaluation, the results show that the INNGNN model performs better than comparable algorithms. Therefore, it is confirmed that the INNGNN model can effectively capture the temporal and spatial relationships and better predict air quality.