Data Science and Knowledge Engineering for Sensing Decision Support 2018
DOI: 10.1142/9789813273238_0160
|View full text |Cite
|
Sign up to set email alerts
|

Machine learning using synthetic and real data: Similarity of evaluation metrics for different healthcare datasets and for different algorithms

Abstract: Sharing data is often a risk in terms of security and privacy especially if the data is sensitive. Algorithms can be used to generate synthetic data from an original raw dataset in order to share data that are considered more 'privacy preserving', and that increase the level of anonymity. In this paper, we carry out an experiment to study the validity of conducting machine learning on synthetic data. We compare the evaluation metrics produced from machine learning models that were trained using synthetic data … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
15
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 24 publications
(15 citation statements)
references
References 5 publications
(6 reference statements)
0
15
0
Order By: Relevance
“…While a number of synthetic data generators have been developed, empirical evidence of their efficacy has not been fully explored. This work extends a preliminary study [ 18 ] and investigates whether fully synthetic data can preserve the hidden complex patterns supervised machine learning can uncover from real data and therefore whether it can be used as a valid alternative to real data when developing eHealth apps and health care policy making solutions. This will be achieved by experimenting with a range of open health care datasets.…”
Section: Introductionmentioning
confidence: 72%
“…While a number of synthetic data generators have been developed, empirical evidence of their efficacy has not been fully explored. This work extends a preliminary study [ 18 ] and investigates whether fully synthetic data can preserve the hidden complex patterns supervised machine learning can uncover from real data and therefore whether it can be used as a valid alternative to real data when developing eHealth apps and health care policy making solutions. This will be achieved by experimenting with a range of open health care datasets.…”
Section: Introductionmentioning
confidence: 72%
“…Narrow or specific measures are widely used for assessing synthetic data [15], [19], [20], [27], [31], [32]. They are useful when the analysis to be performed on the synthetic data is known ahead of time.…”
Section: B Utility Metrics: Overview and Classificationmentioning
confidence: 99%
“…We chose classification as it is a popular tool for synthetic data evaluation. On the other hand, one of the objectives of this investigation is to evaluate whether the other three dimensions of quality are good predictors of application-level fidelity [15], [19], [20], [44].…”
Section: ) Application Fidelitymentioning
confidence: 99%
See 1 more Smart Citation
“…Such an evaluation involves comparing the performance metrics of predictive models trained on synthetic and on real data (called as model compatibility). This performance of a machine learning models trained and tested on real and or synthetic data is compared based on different scenarios [12,14,18]: Train on Real and Test on Synthetic data (T RT S) Train on Synthetic and Test on Real (T ST R), Train on Real, Test on Real (T RT R) and Train on Synthetic, Test on Synthetic (T ST S), and lastly trained and tested on a mixture of real and synthetic data (T MT M). In principle, these scenarios are transferable to the evaluation of synthetic data in recommender systems.…”
Section: Reliable Evaluationmentioning
confidence: 99%