IntroductionThe reliability of data‐driven predictions in real‐world scenarios remains uncertain. This study aimed to develop and validate a machine‐learning‐based model for predicting clinical outcomes using real‐world data from an electronic clinical pathway (ePath) system.MethodsAll available data were collected from patients with lung cancer who underwent video‐assisted thoracoscopic surgery at two independent hospitals utilizing the ePath system. The primary clinical outcome of interest was prolonged air leak (PAL), defined as drainage removal more than 2 days post‐surgery. Data‐driven prediction models were developed in a cohort of 314 patients from a university hospital applying sparse linear regression models (least absolute shrinkage and selection operator, ridge, and elastic net) and decision tree ensemble models (random forest and extreme gradient boosting). Model performance was then validated in a cohort of 154 patients from a tertiary hospital using the area under the receiver operating characteristic curve (AUROC) and calibration plots.ResultsTo mitigate bias, variables with missing data related to PAL or those with high rates of missing data were excluded from the dataset. Fivefold cross‐validation indicated improved AUROCs when utilizing key variables, even post‐imputation of missing data. Dichotomizing continuous variables enhanced performance, particularly when fewer variables were employed in the decision tree ensemble models. Consequently, regression models incorporating seven key variables in complete case analysis demonstrated superior discriminatory ability for both internal (AUROCs: 0.77–0.84) and external cohorts (AUROCs: 0.75–0.84). These models exhibited satisfactory calibration in both cohorts.ConclusionsThe data‐driven prediction model implementing the ePath system exhibited adequate performance in predicting PAL post‐video‐assisted thoracoscopic surgery, optimizing variables and considering population characteristics in a real‐world setting.