Can synthetic data accurately mimic oncology clinical trials?

Kababji, Samer El; Mitsakakis, Nicholas; Fang, Xi; Beltran-Bless, Ana-Alicia; Pond, Gregory R.; Vandermeer, Lisa; Radhakrishnan, Dhenuka; Mosquera, Lucy; Clemons, Mark; Emam, Khaled El

doi:10.1200/jco.2023.41.16_suppl.1554

Cited by 2 publications

(1 citation statement)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To address these challenges, the exploration of methods for generating synthetic data is underway. [13][14][15][16] The Bayesian network (BN) is a graphical structure that represents the conditional probability of nodes, where the nodes represent continuous or discrete nodes. 17 Studies have demonstrated that a BN can effectively capture variable correlations and generate synthetic data resembling the original data set.…”

Section: Introductionmentioning

confidence: 99%

Synthetic Data Improve Survival Status Prediction Models in Early-Onset Colorectal Cancer

Kim,

Jang,

Sim

et al. 2024

JCO Clin Cancer Inform

View full text Add to dashboard Cite

PURPOSE In artificial intelligence–based modeling, working with a limited number of patient groups is challenging. This retrospective study aimed to evaluate whether applying synthetic data generation methods to the clinical data of small patient groups can enhance the performance of prediction models. MATERIALS AND METHODS A data set collected by the Cancer Registry Library Project from the Yonsei Cancer Center (YCC), Severance Hospital, between January 2008 and October 2020 was reviewed. Patients with colorectal cancer younger than 50 years who started their initial treatment at YCC were included. A Bayesian network–based synthesizing model was used to generate a synthetic data set, combined with the differential privacy (DP) method. RESULTS A synthetic population of 5,005 was generated from a data set of 1,253 patients with 93 clinical features. The Hellinger distance and correlation difference metric were below 0.3 and 0.5, respectively, indicating no statistical difference. The overall survival by disease stage did not differ between the synthetic and original populations. Training with the synthetic data and validating with the original data showed the highest performances of 0.850, 0.836, and 0.790 for the Decision Tree, Random Forest, and XGBoost models, respectively. Comparison of synthetic data sets with different epsilon parameters from the original data sets showed improved performance >0.1%. For extremely small data sets, models using synthetic data outperformed those using only original data sets. The reidentification risk measures demonstrated that the epsilons between 0.1 and 100 fell below the baseline, indicating a preserved privacy state. CONCLUSION The synthetic data generation approach enhances predictive modeling performance by maintaining statistical and clinical integrity, and simultaneously reduces privacy risks through the application of DP techniques.

show abstract

Section: Introductionmentioning

confidence: 99%

Synthetic Data Improve Survival Status Prediction Models in Early-Onset Colorectal Cancer

Kim,

Jang,

Sim

et al. 2024

JCO Clin Cancer Inform

View full text Add to dashboard Cite

show abstract

Validation Assessment of Privacy‐Preserving Synthetic Electronic Health Record Data: Comparison of Original Versus Synthetic Data on Real‐World COVID‐19 Vaccine Effectiveness

Wang,

Mott,

Zhang

et al. 2024

Pharmacoepidemiology and Drug

View full text Add to dashboard Cite

PurposeTo assess the validity of privacy‐preserving synthetic data by comparing results from synthetic versus original EHR data analysis.MethodsA published retrospective cohort study on real‐world effectiveness of COVID‐19 vaccines by Maccabi Healthcare Services in Israel was replicated using synthetic data generated from the same source, and the results were compared between synthetic versus original datasets. The endpoints included COVID‐19 infection, symptomatic COVID‐19 infection and hospitalization due to infection and were also assessed in several demographic and clinical subgroups. In comparing synthetic versus original data estimates, several metrices were utilized: standardized mean differences (SMD), decision agreement, estimate agreement, confidence interval overlap, and Wald test. Synthetic data were generated five times to assess the stability of results.ResultsThe distribution of demographic and clinical characteristics demonstrated very small difference (< 0.01 SMD). In the comparison of vaccine effectiveness assessed in relative risk reduction between synthetic versus original data, there was a 100% decision agreement, 100% estimate agreement, and a high level of confidence interval overlap (88.7%–99.7%) in all five replicates across all subgroups. Similar findings were achieved in the assessment of vaccine effectiveness against symptomatic COVID‐19 Infection. In the comparison of hazard ratios for COVID 19‐related hospitalization and odds ratio for symptomatic COVID‐19 Infection, the Wald tests suggested no significant difference between respective effect estimates in all five replicates for all patient subgroups but there were disagreements in estimate and decision metrices in some subgroups and replicates.ConclusionsOverall, comparison of synthetic versus original real‐world data demonstrated good validity and reliability. Transparency on the process to generate high fidelity synthetic data and assurances of patient privacy are warranted.

show abstract

Can synthetic data accurately mimic oncology clinical trials?

Cited by 2 publications

References 0 publications

Synthetic Data Improve Survival Status Prediction Models in Early-Onset Colorectal Cancer

Synthetic Data Improve Survival Status Prediction Models in Early-Onset Colorectal Cancer

Validation Assessment of Privacy‐Preserving Synthetic Electronic Health Record Data: Comparison of Original Versus Synthetic Data on Real‐World COVID‐19 Vaccine Effectiveness

Contact Info

Product

Resources

About