Approach and Method for Generating Realistic Synthetic Electronic Healthcare Records for Secondary Use

Dube, Kudakwashe; Gallagher, Thomas

doi:10.1007/978-3-642-53956-5_6

Cited by 21 publications

(16 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Most of the SDC/SDL literature focuses on survey data from the social sciences and demography. The generation of synthetic electronic health records has been addressed in Dube and Gallagher [8].…”

Section: Related Workmentioning

confidence: 99%

“…Given the risks of re-identification of patient data and the delays inherent in making such data more widely available, synthetically generated data is a promising alternative or addition to standard anonymization procedures. Synthetic data generation has been researched for nearly three decades [3] and applied across a variety of domains [4,5], including patient data [6] and electronic health records (EHR) [7,8]. It can be a valuable tool when real data is expensive, scarce or simply unavailable.…”

Section: Introductionmentioning

confidence: 99%

“…A number of synthetic patient data generation methods aim to minimize the use of actual patient data by combining simulation, public population-level statistics, and domain expert knowledge bases [7][8][9][10]. For example, in Dube and Gallagher [8] synthetic electronic health records are generated by leveraging publicly available health statistics, clinical practice guidelines, and medical coding and terminology standards. In a related approach, patient demographics (obtained from actual patient data) are combined with expert-curated, publicly available patient care patterns to generate synthetic electronic medical records [9].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Generation and evaluation of synthetic patient data

Gonçalves

Ray

Soper

et al. 2020

BMC Med Res Methodol

224

157

View full text Add to dashboard Cite

Background: Machine learning (ML) has made a significant impact in medicine and cancer research; however, its impact in these areas has been undeniably slower and more limited than in other application domains. A major reason for this has been the lack of availability of patient data to the broader ML research community, in large part due to patient privacy protection concerns. High-quality, realistic, synthetic datasets can be leveraged to accelerate methodological developments in medicine. By and large, medical data is high dimensional and often categorical. These characteristics pose multiple modeling challenges. Methods: In this paper, we evaluate three classes of synthetic data generation approaches; probabilistic models, classification-based imputation models, and generative adversarial neural networks. Metrics for evaluating the quality of the generated synthetic datasets are presented and discussed. Results: While the results and discussions are broadly applicable to medical data, for demonstration purposes we generate synthetic datasets for cancer based on the publicly available cancer registry data from the Surveillance Epidemiology and End Results (SEER) program. Specifically, our cohort consists of breast, respiratory, and non-solid cancer cases diagnosed between 2010 and 2015, which includes over 360,000 individual cases. Conclusions: We discuss the trade-offs of the different methods and metrics, providing guidance on considerations for the generation and usage of medical synthetic data.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Generation and evaluation of synthetic patient data

Gonçalves

Ray

Soper

et al. 2020

BMC Med Res Methodol

224

157

View full text Add to dashboard Cite

show abstract

“…From these statistics, we can get a fair picture regarding the demographics, and prevalence of symptoms and comorbidities in the infected population. The synthetic data was generated by GRiSER’s method [21] Fig 1. explains the approach to build our dataset from open source information and clinical knowledge.…”

Section: Proposed Methodsmentioning

confidence: 99%

Using Machine Learning to assess Covid-19 risks

Muthya¹,

Nair²,

Arokiaswamy³

et al. 2020

Preprint

View full text Add to dashboard Cite

IMPORTANCE: Identifying potential Covid-19 patients in the general population is a huge challenge at the moment. Given the low availability of infected Covid-19 patients clinical data, it is challenging to understand and comprehend similar and complex patterns in these symptomatic patients. Laboratory testing for Covid19 antigen with RT-PCR | (Reverse Transcriptase) is not possible or economical for whole populations. OBJECTIVE: To develop a Covid risk stratifier model that classifies people into different risk cohorts, based on their symptoms and validate the same. DESIGN: Analysis of Covid cases across Wuhan and New York were done to identify the course of these cases prior to being symptomatic and being hospitalised for the infection. A dataset based on these statistics were generated and was then fed into an unsupervised learning algorithm to reveal patterns and identify similar groups of people in the population. Each of these cohorts were then classified and identified into three risk levels that were validated against the real world cases and studies. SETTING: The study is based on general population. PARTICIPANTS: The adult population were considered for the analysis, development and validation of the model RESULTS: Of 1 million observations generated, 20% of them exhibited Covid symptoms and patterns, and 80% of them belonged to the asymptomatic and non-infected group of people. Upon clustering, three clinically obvious clusters were obtained, out of which the Cluster A had 20% of the symptomatic cases that were classified into one cohort, the other two cohorts, Cluster B had people with no symptoms but with high number of comorbidities and Cluster C had people with few leading indicators for the infection with few comorbidities. This was then validated against 300 participants whose data we collected as a part of a research study through our Covid-research tool and about 92% of them were classified correctly. CONCLUSION: A model was developed and validated that classifies people into Covid risk categories based on their symptoms. This can be used to monitor and track cases that rapidly transition into being symptomatic which eventually get tested positive for the infection in order to initiate early medical interventions. KEYWORDS: Covid-19, Synthetic Data, Patient Clustering, Unsupervised Learning, Risk Classification

show abstract

“…To add missing features, modelling-based approaches have to be integrated into a data-driven generator [7]. They also require access to a background EHR corpus, which is subject to privacy laws and may also lead to inadvertent disclosure of protected health information from the real patient data [5,6,10].…”

Section: Prior Workmentioning

confidence: 99%

Desiderata for a Synthetic Clinical Data Generator

Wiedekopf

Ulrich

Essenwanger

et al. 2021

Studies in Health Technology and Informatics

View full text Add to dashboard Cite

The current movement in Medical Informatics towards comprehensive Electronic Health Records (EHRs) has enabled a wide range of secondary use cases for this data. However, due to a number of well-justified concerns and barriers, especially with regards to information privacy, access to real medical records by researchers is often not possible, and indeed not always required. An appealing alternative to the use of real patient data is the employment of a generator for realistic, yet synthetic, EHRs. However, we have identified a number of shortcomings in prior works, especially with regards to the adaptability of the projects to the requirements of the German healthcare system. Based on three case studies, we define a non-exhaustive list of requirements for an ideal generator project that can be used in a wide range of localities and settings, to address and enable future work in this regard.

show abstract

Approach and Method for Generating Realistic Synthetic Electronic Healthcare Records for Secondary Use

Cited by 21 publications

References 17 publications

Generation and evaluation of synthetic patient data

Generation and evaluation of synthetic patient data

Using Machine Learning to assess Covid-19 risks

Desiderata for a Synthetic Clinical Data Generator

Contact Info

Product

Resources

About