Background: Synthetic patient data (SPD) generation for survival analysis in oncology trials holds significant potential for accelerating clinical development. Various machine learning methods, including classification and regression trees (CART), random forest (RF), Bayesian network (BN), and CTGAN, have been employed for this purpose, but their performance in reflecting actual patient survival data remains under investigation.
Objective:The aim of this study was to determine the most suitable SPD generation method for oncology trials, specifically focusing on both progression free survival (PFS) and overall survival (OS), which are the primary evaluation endpoints in oncology trials. To achieve this goal, we conducted a comparative simulation of 4 generation methods: CART, RF, BN, and the CTGAN, and the performance of each method was evaluated. 0.9. Conversely, for the other methods, no consistent trend was observed for either PFS or OS. The reason why CART demonstrated better similarity than RF was that CART caused overfitting and RF, which is a kind of ensemble learning, prevented it. In SPD generation, the statistical properties close to the actual data should be the focus, not a well-generalized prediction model. Both the BN and CTGAN methods cannot accurately reflect the statistical properties of the actual data because small datasets are not suitable.
Conclusions:As a method for generating SPD for survival data from small datasets, such as clinical trial data, CART demonstrated to be the most effective method compared to RF, BN, and CTGAN. Additionally, it is possible to improve CART-based generation methods by incorporating feature engineering and other methods in future work.