The accuracy of urban traffic flow prediction is influenced by nearby regional road network, historical traffic flow and seasonal climate, which has complex spatial and temporal dependence. In view of the above factors, we proposed a multi-modal traffic flow prediction model fusing road network, historical traffic flow and weather data. Firstly, the weighted spatio-temporal graph was constructed based on the traffic flow time series data, and the weighted STSGCN model was used to extract the spatio-temporal graph features. Secondly, the image sequence was constructed by road network, vehicle track and sensor position data, and the visual features were extracted by ResNet. Finally, based on the MCB and Attention two-channel multi-modal fusion model, the spatio-temporal graph features and the visual features of the image sequence were fused to obtain the aligned fusion vector. Finally, the aligned fusion vector was combined with the weather feature vector to complete the traffic flow prediction. The experimental results showed that the prediction results of our proposed model were better than those of other baseline models. At the same time, the ablation results also proved the effectiveness of each module in our proposed prediction model.