Gorka Epelde scite author profile

Background The exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce. Objective This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. Methods A total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed. Results A total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. Conclusions The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.

show abstract

Natural Interaction between Avatars and Persons with Alzheimer’s Disease

Carrasco

Epelde

Moreno

et al. 2008

View full text Add to dashboard Cite

Upper Limb Posture Estimation in Robotic and Virtual Reality-Based Rehabilitation

Cortés

Ardanza

Molina‐Rueda

et al. 2014

BioMed Research International

View full text Add to dashboard Cite

New motor rehabilitation therapies include virtual reality (VR) and robotic technologies. In limb rehabilitation, limb posture is required to (1) provide a limb realistic representation in VR games and (2) assess the patient improvement. When exoskeleton devices are used in the therapy, the measurements of their joint angles cannot be directly used to represent the posture of the patient limb, since the human and exoskeleton kinematic models differ. In response to this shortcoming, we propose a method to estimate the posture of the human limb attached to the exoskeleton. We use the exoskeleton joint angles measurements and the constraints of the exoskeleton on the limb to estimate the human limb joints angles. This paper presents (a) the mathematical formulation and solution to the problem, (b) the implementation of the proposed solution on a commercial exoskeleton system for the upper limb rehabilitation, (c) its integration into a rehabilitation VR game platform, and (d) the quantitative assessment of the method during elbow and wrist analytic training. Results show that this method properly estimates the limb posture to (i) animate avatars that represent the patient in VR games and (ii) obtain kinematic data for the patient assessment during elbow and wrist analytic rehabilitation.

show abstract

Synthetic data generation for tabular health records: A systematic review

et al. 2022

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Gorka Epelde

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

Natural Interaction between Avatars and Persons with Alzheimer’s Disease

Upper Limb Posture Estimation in Robotic and Virtual Reality-Based Rehabilitation

Synthetic data generation for tabular health records: A systematic review

Contact Info

Product

Resources

About