Stock price prediction is considered a classic and challenging task, with the potential to aid traders in making more profitable trading decisions. Significant improvements in stock price prediction methods based on deep learning have been observed in recent years. However, most existing methods are reliant solely on historical stock price data for predictions, resulting in the inability to capture market dynamics beyond price indicators, thus limiting their performance to some extent. Therefore, combining social media text with historical stock price information has proposed a novel stock price prediction method, known as the Deep Cross-Modal Information Fusion Network (DCIFNet). The process is initiated by DCIFNet, which employs temporal convolution processes to encode stock prices and Twitter content. This ensures that each element has sufficient information about its surrounding components. Following this, the outcomes are inputted into a cross-modal fusion structure based on transformers to enhance the integration of crucial information from stock prices and Twitter content. Lastly, a multi-graph convolution attention network is introduced to depict the relationships between different stocks from diverse perspectives. This facilitates the more effective capturing of industry affiliations, Wikipedia references, and associated relationships among linked stocks, ultimately leading to an enhancement in stock price prediction accuracy. Trend prediction and simulated trading experiments are conducted on high-frequency trading datasets spanning nine different industries. Comparative assessments with the Multi-Attention Network for Stock Prediction (MANGSF) method, as well as ablation experiments, confirm the effectiveness of the DCIFNet approach, resulting in an accuracy rate of 0.6309, a marked improvement compared to representative methods in the field.