Educational exercises are crucial factors in the successful implementation of online education as they play a key role in assessing students’ learning and supporting teachers in instruction. These exercises encompass two primary types of data: text and images. However, existing methods for extracting exercise features only focus on the textual data, neglecting the rich semantic information present in the image data. Consequently, the exercise characterization vector generated by these methods struggles to fully characterize the exercise. To address these limitations, this paper proposes a multimodal information fusion-based exercise characterization model called MIFM. The MIFM model tackles the challenges of current exercise modeling methods by performing extraction and fusion operations on the heterogeneous features present in exercises. It employs a dual-stream architecture to separately extract features from images and text, and establishes connections between heterogeneous data using cross-modality attention methods. Finally, the heterogeneous features are fused using a Bi-LSTM combined with a multi-head attention mechanism. The resulting model produces a multimodal exercise characterization vector that fuses both modalities and incorporates three knowledge elements. In the experiments, by using the model to replace the exercise characterization modules in the three educational tasks, specifically, it achieves an increased ACC value of 72.35% in the knowledge mapping task, a heightened PCC value of 46.83% in the exercise difficulty prediction task, and an elevated AUC value of 62.57% in the student performance prediction task.