Objective This study aims to address the limitations of current clinical methods in predicting delivery mode by constructing a multimodal neural network-based model. The model utilizes data from a digital twin-empowered labor monitoring system, including computerized cardiotocography (cCTG), ultrasound (US) examination data, and electronic health records (EHRs) of pregnant women. Methods The model integrates three modalities of data from 105 pregnant women (76 vaginal deliveries and 29 cesarean deliveries) at the Department of Obstetrics and Gynecology of The First Affiliated Hospital of Jinan University, Guangzhou, China. It employs a hybrid architecture of a convolutional neural network (CNN) and bi-directional long short-term memory (BiLSTM) to compress the data into a single feature vector for each patient. Results The designed model achieves a cross-validation accuracy of 93.33%, an F1-score of 86.26%, an area under the receiver operating characteristic curve of 97.10%, and a Brier Score of 6.67%. Importantly, while cCTG and EHRs are crucial for labor management, the integration of US imaging data significantly enhances prediction accuracy. Conclusion The findings of this study suggest that the developed multimodal model is a promising tool for predicting delivery mode and provides a comprehensive approach to intrapartum maternal and fetal health monitoring. The integration of multi-source data, including real-time information, holds potential for further improving the algorithm's predictive accuracy as the volume of analyzed data increases. This could be highly beneficial for dynamically fusing data from different sources throughout the maternal and fetal health lifecycle, from pregnancy to delivery.