The acceleration of industrialization and urbanization has recently brought about serious air pollution problems, which threaten human health and lives, the environmental safety, and sustainable social development. Air quality prediction is an effective approach for providing early warning of air pollution and supporting cleaner industrial production. However, existing approaches have suffered from a weak ability to capture long-term dependencies and complex relationships from time series PM2.5 data. To address this problem, this paper proposes a new deep learning model called temporal difference-based graph transformer networks (TDGTN) to learn long-term temporal dependencies and complex relationships from time series PM2.5 data for air quality PM2.5 prediction. The proposed TDGTN comprises of encoder and decoder layers associated with the developed graph attention mechanism. In particular, considering the similarity of different time moments and the importance of temporal difference between two adjacent moments for air quality prediction, we first construct graph-structured data from original time series PM2.5 data at different moments without explicit graph structure. Then, based on the constructed graph, we improve the self-attention mechanism with the temporal difference information, and develop a new graph attention mechanism. Finally, the developed graph attention mechanism is embedded into the encoder and decoder layers of the proposed TDGTN to learn long-term temporal dependencies and complex relationships from a graph prospective on air quality PM2.5 prediction tasks. To verify the effectiveness of the proposed method, we conduct air quality prediction experiments on two real-world datasets in China, such as Beijing PM2.5 dataset ranging from 01/01/2010 to 12/31/2014 and Taizhou PM2.5 dataset ranging from 01/01/2017 to 12/31/2019. Compared with other air quality forecasting methods, such as autoregressive moving average (ARMA), support vector regression (SVR), convolutional neural network (CNN), long short-term memory (LSTM), the original Transformer, our experiment results indicate that the proposed method achieves more accurate results on both short-term (1 hour) and long-term (6, 12, 24, 48 hours) air quality prediction tasks.