2020
DOI: 10.2196/18910
|View full text |Cite
|
Sign up to set email alerts
|

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

Abstract: Background The exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce. Objective This work sets out to understand the difference in performance of supervised machine learning models trained on synthet… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
95
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 111 publications
(96 citation statements)
references
References 40 publications
1
95
0
Order By: Relevance
“…Moreover, the combination of the MIDAS developed GYDRA data preparation tool, alongside synthetic dataset generation strategies, can enable hospitals and healthcare providers, to: 1) refine and prepare their datasets (with the required metadata description), and; 2) share synthetically generated privacypreserving datasets with the scientific community, that follow statistical patterns similar to the real data, and have proven to be reliable for training machine learning models [28]. These mechanisms would enable users to load a controlled dataset into the MIDAS platform and to develop in-house analytics, whilst simultaneously allowing the scientific community to develop AI models based on synthetic datasets that can later be fed back to the policy-makers through the MIDAS platform.…”
Section: Ingesting Useful Open Data Sourcesmentioning
confidence: 99%
“…Moreover, the combination of the MIDAS developed GYDRA data preparation tool, alongside synthetic dataset generation strategies, can enable hospitals and healthcare providers, to: 1) refine and prepare their datasets (with the required metadata description), and; 2) share synthetically generated privacypreserving datasets with the scientific community, that follow statistical patterns similar to the real data, and have proven to be reliable for training machine learning models [28]. These mechanisms would enable users to load a controlled dataset into the MIDAS platform and to develop in-house analytics, whilst simultaneously allowing the scientific community to develop AI models based on synthetic datasets that can later be fed back to the policy-makers through the MIDAS platform.…”
Section: Ingesting Useful Open Data Sourcesmentioning
confidence: 99%
“…Such a knowledge-based model depends on prior knowledge of the system, and how much we can intellect about it (Kim et al, 2017;Bonnéry et al, 2019). On one hand, theory-based modelling aims at understanding and offers interpretability, on the other when modelling complex systems, simplifications and assumptions are inevitable, leading to inaccuracies or reduced utility (Hand, 2019;Rankin et al, 2020). In fact, relying on population-level statistics does not produce models capable of reproducing heterogeneous health outcomes (Chen et al, 2019a).…”
Section: Synthetic Datamentioning
confidence: 99%
“…ehrGAN is developed for sequences of medical codes Che et al. It learns a transitional distribution, combining an Encoder-Decoder CNN (Rankin et al, 2020) with VCD . The ehrGAN generator is trained to decode a random vector mixed with the latent space representation of a real patient (See Panel 2).…”
Section: Semi-supervised Learningmentioning
confidence: 99%
“…They bound their representational power to correlations intelligible to the modeler, being prone to obscure inaccuracies. SD generated by these models hits a ceiling of utility (Rankin et al, 2020). In the ML field, generative models learn an approximation of the multi-modal distribution, from which we can draw synthetic samples (Goodfellow et al, 2014).…”
Section: Synthetic Datamentioning
confidence: 99%
“…Having served its primary purpose, this wealth of detailed information can further benefit patient well-being by sustaining medical research and development. That is to say, improving the development life-cycle of Health Informatics (HI), the predictive accuracy of Machine Learning (ML) algorithms, or enabling discoveries in research on clinical decisions, triage decisions, inter-institution collaboration, and HI automation (Rudin et al, 2020;Rankin et al, 2020). Big health data is the underpinning of two prime objectives of precision medicine: individualization of patient interventions and inferring the workings of biological systems from high-level analysis (Capobianco, 2020).…”
Section: Introductionmentioning
confidence: 99%