Hypertension and diabetes have become a global health and economic issue, being among the major chronic conditions worldwide, particularly in developed countries. To face this global problem, a better knowledge about these diseases becomes crucial to characterize chronic patients. Our aim is twofold: (1) to provide an efficient visual tool for identifying clinical patterns in high-dimensional data; and (2) to characterize the patient health-status through a data-driven approach using electronic health records of healthy, hypertensive and diabetic populations. We propose a two-stage methodology that uses diagnosis and drug codes of healthy and chronic patients associated to the University Hospital of Fuenlabrada in Spain. The first stage applies the Self-Organizing Map on the aforementioned data to get a set of prototype patients which are projected onto a grid of nodes. Each node has associated a prototype patient that captures relationships among clinical characteristics. In the second stage, clustering methods are applied on the prototype patients to find groups of patients with a similar health-status. Clusters with distinctive patterns linked to chronic conditions were found, being the most remarkable highlights: a cluster of pregnant women emerged among the hypertensive population, and two clusters of diabetic individuals with significant differences in drug-therapy (insulin and non-insulin dependant). The proposed methodology showed to be effective to explore relationships within clinical data and to find patterns related to diabetes and hypertension in a visual way. Our methodology raises as a suitable alternative for building appropriate clinical groups, becoming a promising approach to be applied to any population due to its data-driven philosophy. A thorough analysis of these groups could spawn new and fruitful findings.INDEX TERMS Electronic health records, machine learning, self organizing maps, clustering, data visualization, chronic conditions.
Machine Learning (ML) methods have become important to enhance the performance of decision-support predictive models. However, class imbalance is one of the main challenges for developing ML models, because it limits the generalization of these models, and biases the learning algorithms. In this paper, we consider oversampling methods for generating synthetic categorical clinical data aiming to improve the predictive performance in ML models, and the identification of risk factors for cardiovascular diseases (CVDs). We performed a comparative study of several categorical synthetic data generation methods, including Generative Adversarial Networks (GANs). Then, we assessed the impact of combining oversampling strategies and linear and nonlinear supervised ML methods. Lastly, we conducted a post-hoc model interpretability based on the importance of the risk factors. Experimental results show the potential of GAN-based models for generating high-quality categorical synthetic data, yielding probability mass functions that are highly close to real data, maintaining relevant insights, and contributing to increase the predictive performance. The GAN-based model and a linear classifier outperforms other oversampling techniques, improving 2\% the area under the curve. These results demonstrate the capability of synthetic data to help both in determining risk factors and building models for CVD prediction.
Machine Learning (ML) methods have become important for enhancing the performance of decision-support predictive models. However, class imbalance is one of the main challenges for developing ML models, because it may bias the learning process and the model generalization ability. In this paper, we consider oversampling methods for generating synthetic categorical clinical data aiming to improve the predictive performance in ML models, and the identification of risk factors for cardiovascular diseases (CVDs). We performed a comparative study of several categorical synthetic data generation methods, including Synthetic Minority Oversampling Technique Nominal (SMOTEN), Tabular Variational Autoencoder (TVAE) and Conditional Tabular Generative Adversarial Networks (CTGANs). Then, we assessed the impact of combining oversampling strategies and linear and nonlinear supervised ML methods. Lastly, we conducted a post-hoc model interpretability based on the importance of the risk factors. Experimental results show the potential of GAN-based models for generating high-quality categorical synthetic data, yielding probability mass functions that are very close to those provided by real data, maintaining relevant insights, and contributing to increasing the predictive performance. The GAN-based model and a linear classifier outperform other oversampling techniques, improving the area under the curve by 2%. These results demonstrate the capability of synthetic data to help with both determining risk factors and building models for CVD prediction.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.