Driver intention prediction allow drivers to perceive possible dangers in the fastest time and has become one of the most important research topics in the field of self-driving in recent years. In this study, we propose a driver intention prediction method based on multi-dimensional crossmodality information interaction. First, an efficient video recognition network is designed to extract channel-temporal features of in-side (driver) and out-side (road) videos respectively, in which we design a cross-modality channel-spatial weight mechanism to achieve information interaction between the two feature extraction networks corresponding respectively to the two modalities, and we also introduce a contrastive learning module by which we force the two feature extraction networks to enhance structural knowledge interaction. Then, the obtained representations of in-and out-side videos are fused using a Res-Layer based module to get a preliminary prediction which is then corrected by incorporating the GPS information to obtain a final decision. Besides, we employ a multi-task framework to train the entire network. We validate the proposed method on the public dataset Brain4Car, and the results show that the proposed method achieves competitive results in accuracy while balancing performance and computation.