Background
Handling missing data is indispensable in health care real-world data processing. Deleting or imputing missing data may introduce error or lead to multicollinearity. Therefore, we tried to explore a novel missing data processing method to avoid the above issues.
Method
By exploring an optimal deleting way of columns and rows with missing data, we developed a missing data processing method which can retain most information of original datasets. Traditionally, the goal can be realized by traversing all possible deleting combinations. But the computational cost is too high to use in large datasets. Therefore, we established an Optimal Intact Subset Method (OIS.Method) by using an indicator containing missing information of both columns and rows to determine an optimal deleting order of columns. OIS.Method can ascertain the optimal deleting way and simplify computing meanwhile. In order to validate the effectiveness of OIS.Method, we compared OIS.Method with five other data-imputation methods in 700 classification datasets (simulated datasets 1) generated by computer. In order to simulate real-world datasets, we generated simulated datasets 2: introducing redundant variables in simulated datasets 1. We also compared OIS.Method with control methods on that. Finally, we validated OIS.Method in two real-world classification tasks: 1. predict the risk of hypotension during dialysis, 2. predict the risk of drug adverse reaction in elderly patients with type 2 diabetes.
Results
In simulated datasets 1, we found that OIS.Method performed well when the distribution of missing data was unbalanced among columns. In simulated datasets 2, the comprehensive performance of OIS.Method was better in all evaluating dimensions. In two real-world datasets, OIS.Method could acquire better classification performance. We used the area under ROC curve (AUC) to evaluate it: OIS.Method VS Simple Impute VS Random Forest VS Modified Random Forest, 0.8179 ± 0.0005VS0.8116 ± 0.0002VS0.8087 ± 0.0009VS0.8093 ± 0.0014 in task1, and 0.7028VS0.6963VS0.6957VS0.6699 in task2.
Conclusions
Our study provided a novel method for handing missing data in real-world study. Compared with other existing missing data processing methods, the calculation of OIS.Method is smaller, and OIS.Method can reflect the true data situation of original datasets. Moreover, OIS.Method is well-suited for real-world datasets with large sample size and multiple variables.