ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data

Zhou, Yi‐Hui; Saghapour, Ehsan

doi:10.3389/fgene.2021.691274

Cited by 5 publications

(1 citation statement)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Feature pre-processing included clipping the outlier values to the 5th and 95th percentile values and scaling between [0,1] using the Minmax Scalar package from sklearn (version 1.5.0) [34]. Since each of the selected features are ordinal variables and had a very low missing rate of <5%, we imputed the missing values for each feature column using the median value of that feature across all visits of all patients, following previous work [35] (Fig 1).…”

Section: Feature Pre-processingmentioning

confidence: 99%

Examining heterogeneity in dementia using data-driven unsupervised clustering of cognitive profiles

Kumar,

Oh,

Schindler

et al. 2024

Preprint

View full text Add to dashboard Cite

Dementia is characterized by a decline in memory and thinking that is significant enough to impair function in activities of daily living. Patients seen in dementia specialty clinics are highly heterogenous with a variety of different symptoms that progress at different rates. Recent research has focused on finding data-driven subtypes for revealing new insights into dementia’s underlying heterogeneity, compared to analyzing the entire cohort as a single homogeneous group. However, existing studies on dementia subtyping suffer from the following limitations: (i) focusing on AD-related dementia only and not examining heterogeneity within dementia as a whole, (ii) using only cross-sectional baseline visit information for clustering and (iii) predominantly relying on expensive imaging biomarkers as features for clustering. In this study, we used a data-driven unsupervised clustering algorithm named SillyPutty, in combination with hierarchical clustering on neuropsychological assessment scores to estimate subtypes within a real-world clinical dementia cohort. We incorporated all longitudinal patient visits into our clustering analysis, instead of relying only on baseline visits, allowing us to explore the ongoing relationship between subtypes and disease progression over time. Results showed evidence of (i) subtypes with very mild or mild dementia being more heterogenous in their cognitive profiles and risk of disease progression.

show abstract

Section: Feature Pre-processingmentioning

confidence: 99%

Examining heterogeneity in dementia using data-driven unsupervised clustering of cognitive profiles

Kumar,

Oh,

Schindler

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

The impact of imputation quality on machine learning classifiers for datasets with missing values

Shadbahr,

Roberts,

Stanczuk

et al. 2023

Commun Med

View full text Add to dashboard Cite

Background Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete samples. The focus of the machine learning researcher is to optimise the classifier’s performance. Methods We utilise three simulated and three real-world clinical datasets with different feature types and missingness patterns. Initially, we evaluate how the downstream classifier performance depends on the choice of classifier and imputation methods. We employ ANOVA to quantitatively evaluate how the choice of missingness rate, imputation method, and classifier method influences the performance. Additionally, we compare commonly used methods for assessing imputation quality and introduce a class of discrepancy scores based on the sliced Wasserstein distance. We also assess the stability of the imputations and the interpretability of model built on the imputed data. Results The performance of the classifier is most affected by the percentage of missingness in the test data, with a considerable performance decline observed as the test missingness rate increases. We also show that the commonly used measures for assessing imputation quality tend to lead to imputed data which poorly matches the underlying data distribution, whereas our new class of discrepancy scores performs much better on this measure. Furthermore, we show that the interpretability of classifier models trained using poorly imputed data is compromised. Conclusions It is imperative to consider the quality of the imputation when performing downstream classification as the effects on the classifier can be considerable.

show abstract

Accelerating Big Data Analysis through LASSO-Random Forest Algorithm in QSAR Studies

et al. 2021

View full text Add to dashboard Cite

Motivation The aim of quantitative structure-activity prediction (QSAR) studies is to identify novel drug-like molecules that can be suggested as lead compounds by means of two approaches, which are discussed in this article. First, to identify appropriate molecular descriptors by focusing on one feature-selection algorithms; and second to predict the biological activities of designed compounds. Recent studies have shown increased interest in the prediction of a huge number of molecules, known as Big Data, using deep learning models. However, despite all these efforts to solve critical challenges in QSAR models, such as over-fitting, massive processing procedures, is major shortcomings of deep learning models. Hence, finding the most effective molecular descriptors in the shortest possible time is an ongoing task. One of the successful methods to speed up the extraction of the best features from big datasets is the use of least absolute shrinkage and selection operator (LASSO). This algorithm is a regression model that selects a subset of molecular descriptors with the aim of enhancing prediction accuracy and interpretability because of removing inappropriate and irrelevant features. Results To implement and test our proposed model, a random forest was built to predict the molecular activities of Kaggle competition compounds. Finally, the prediction results and computation time of the suggested model were compared with the other well-known algorithms, i.e. Boruta-random forest, deep random forest, and deep belief network model. The results revealed that improving output correlation through LASSO-random forest leads to appreciably reduced implementation time and model complexity, while maintaining accuracy of the predictions. Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data

Cited by 5 publications

References 30 publications

Examining heterogeneity in dementia using data-driven unsupervised clustering of cognitive profiles

Examining heterogeneity in dementia using data-driven unsupervised clustering of cognitive profiles

The impact of imputation quality on machine learning classifiers for datasets with missing values

Accelerating Big Data Analysis through LASSO-Random Forest Algorithm in QSAR Studies

Contact Info

Product

Resources

About