Handling missing data is indispensable in health-care real-world data processing. Imputing method may introduce error and multicollinearity. Therefore, we explored (Optimal Intact Subset Method, OIS.Method) to avoid the issues. By exploring an optimal deleting way of columns and rows with missing data, a subset retaining most information of original datasets was determined. Traditionally, we can traverse all deleting ways. But the computational cost is too high to use in large datasets. OIS.Method used an indicator to determine the optimal deleting order which can ascertain the optimal deleting way and simplify computing. In order to validate the effectiveness of OIS.Method, we compared OIS.Method with five other missing data handling methods in simulated real-world classification datasets. Additionally, we validated OIS.Method in two real-world classification tasks. In simulated datasets, the performance of OIS.Method was best(highest AUC was 1). In real-world datasets, OIS.Method could acquire better classification performance. Take AUC for an example: OIS.Method VS Simple Impute VS Random Forest VS Modified Random Forest, 0.8179±0.0005 VS 0.8116±0.0002 VS 0.8087±0.0009 VS 0.8093±0.0014 in task1, and 0.7028±0.0126 VS 0.6963±0.0231 VS 0.6957±0.0247 VS 0.6699±0.0249 in task2. The calculation of OIS.Method is smaller, and it is well-suited for large real-world datasets.
Background Handling missing data is indispensable in health care real-world data processing. Deleting or imputing missing data may introduce error or lead to multicollinearity. Therefore, we tried to explore a novel missing data processing method to avoid the above issues. Method By exploring an optimal deleting way of columns and rows with missing data, we developed a missing data processing method which can retain most information of original datasets. Traditionally, the goal can be realized by traversing all possible deleting combinations. But the computational cost is too high to use in large datasets. Therefore, we established an Optimal Intact Subset Method (OIS.Method) by using an indicator containing missing information of both columns and rows to determine an optimal deleting order of columns. OIS.Method can ascertain the optimal deleting way and simplify computing meanwhile. In order to validate the effectiveness of OIS.Method, we compared OIS.Method with five other data-imputation methods in 700 classification datasets (simulated datasets 1) generated by computer. In order to simulate real-world datasets, we generated simulated datasets 2: introducing redundant variables in simulated datasets 1. We also compared OIS.Method with control methods on that. Finally, we validated OIS.Method in two real-world classification tasks: 1. predict the risk of hypotension during dialysis, 2. predict the risk of drug adverse reaction in elderly patients with type 2 diabetes. Results In simulated datasets 1, we found that OIS.Method performed well when the distribution of missing data was unbalanced among columns. In simulated datasets 2, the comprehensive performance of OIS.Method was better in all evaluating dimensions. In two real-world datasets, OIS.Method could acquire better classification performance. We used the area under ROC curve (AUC) to evaluate it: OIS.Method VS Simple Impute VS Random Forest VS Modified Random Forest, 0.8179 ± 0.0005VS0.8116 ± 0.0002VS0.8087 ± 0.0009VS0.8093 ± 0.0014 in task1, and 0.7028VS0.6963VS0.6957VS0.6699 in task2. Conclusions Our study provided a novel method for handing missing data in real-world study. Compared with other existing missing data processing methods, the calculation of OIS.Method is smaller, and OIS.Method can reflect the true data situation of original datasets. Moreover, OIS.Method is well-suited for real-world datasets with large sample size and multiple variables.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.