In the context of disease prediction model, false negative error occurs when the patient is wrongly predicted as free from the disease.A prediction model development involves the process of data collection and feature selection which extracts relevant features from the dataset. Two commonly employed feature selection approaches are domain knowledge and datadriven, that suffer from bias towards past or current knowledge when applied alone.In this research, we have studied the developmentof measles prediction model by incorporating both the domain knowledge and the data-driven approaches, in particular, the Random Forest classifier.The domain expert has earlier on set the important features based uponhisprior knowledgeon measles for the purpose of minimizing the size of features. Afterward, the attributes became the input in Random Forest classifier and the least important attributes are excluded using the Mean Decrease Gini, in order to experiment its effect on the result. It is found that the removal ofseveral attributes after domain knowledge consultation can provide a good model with less false negative errors.
Measles is an emerging infectious disease with increasing number of reported cases. It is a vaccine-preventable disease;thus, it is common to have imbalanced class problem in the dataset. This study aims to resolve the imbalanced class problem for the prediction of measles infection risk and to compare the predictive results on a balanced dataset based on three machine learningtechniques. The data that was utilized in this study contained 37,884 records of suspected measles casesthat were highly imbalanced towards negative measles cases. The Synthetic Minority Over-Sampling Technique (SMOTE) was performed to balance thedistribution of the target attribute. The balanced dataset was then modelled using logistic regression, decision tree and Naïve Bayes. The predicted results indicated that logistic regression executed on the balanced dataset by SMOTE has the highest and most accurateclassification with 94.5% overall accuracy, 93.9% true positive rate, 5.8% false positive rate and 5.1% false negative rate. Therefore, SMOTE and other over-sampling approaches may be applicable to overcome imbalanced class issues in the medical dataset.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.