Purpose: Prediction of primary treatment failure (PTF) is necessary for patients suffering from diffuse large B-cell lymphoma (DLBCL), since it serves as a prominent means for enhancing front-line outcomes. Utilizing interim 18 F-Fluorodeoxyglucose (FDG) positron emission tomography and computed tomography (PET/CT) image data, we aimed to construct multimodal deep learning (MDL) models to predict possible PTF of low-risk DLBCL, which could enable individualized treatment decision-making in clinical practice.Methods: From June 2016 to November 2020, 205 DLBCL patients undergoing interim 18 F-FDG PET-CT scans and the front-line standard-of-care were enrolled. We also collected other 44 patients for the external validation. We built a powerful backbone by redesigning the famous visual recognition network named Conv-LSTM in aspects of network architecture and learning strategy. On top of our improved backbone, multiple MDL models using different feature fusion strategies were developed and compared, including pixel intermixing model, separate channel model, separate branch model, quantitative weighting model, and hybrid learning model. Moreover, we proposed to use a contrastive training objective in the above best model to enhance the modal correlation of semantic embeddings for further improving prediction performance. For visualization, the region of interest was instructed using an activation map.Results: The MDL model using the hybrid learning strategy provided the best performance in predicting possible PTF with the accuracy of 89.76% (95% con dence interval [CI]: 84.85%-93.20%) in the test cohort. After further optimized by contrastive objective training, the accuracy was improved to 91.22% (95% CI: 86.55%-94.37%). The AUCs of contrastive hybrid learning achieved 0.926 and 0.925 in the test cohort and external validation cohort, respectively. Conclusion: Our model showed outstanding performance for predicting PTF of low-risk DLBCL and hold promise of improving clinical individualized treatment strategies.