As network technology continues to develop, the popularity of various intelligent terminals has accelerated, leading to a rapid growth in the scale of wireless network traffic. This growth has resulted in significant pressure on resource consumption and network security maintenance. The objective of this paper is to enhance the prediction accuracy of cellular network traffic in order to provide reliable support for the subsequent base station sleep control or the identification of malicious traffic. To achieve this target, a cellular network traffic prediction method based on multi-modal data feature fusion is proposed. Firstly, an attributed K-nearest node (KNN) graph is constructed based on the similarity of data features, and the fused high-dimensional features are incorporated into the graph to provide more information for the model. Subsequently, a dual branch spatio-temporal graph neural network with an attention mechanism (DBSTGNN-Att) is designed for cellular network traffic prediction. Extensive experiments conducted on real-world datasets demonstrate that the proposed method outperforms baseline models, such as temporal graph convolutional networks (T-GCNs) and spatial–temporal self-attention graph convolutional networks (STA-GCNs) with lower mean absolute error (MAE) values of 6.94% and 2.11%, respectively. Additionally, the ablation experimental results show that the MAE of multi-modal feature fusion using the attributed KNN graph is 8.54% lower compared to that of the traditional undirected graphs.