BACKGROUND
Supervised machine learning (ML) has made its way into the healthcare literature with results frequently reported using the metrics of accuracy, sensitivity, specificity, recall or F1. While each provides a different perspective on the performance, they remain overall measures on the whole sample, discounting the uniqueness of each case/patient. Intuitively we know that all cases are not equal, but current evaluative approaches do not take case difficulty into account.
OBJECTIVE
A more case-based comprehensive approach is warranted to assess supervised ML outcomes and forms the rationale for this study. We demonstrate how Item Response Theory (IRT) can be used to stratify the data based on how ‘difficult’ each case is to classify, independent of the outcome measure of interest (e.g., accuracy). This stratification allows the evaluation of ML classifiers to take the form of a distribution rather than a single scalar value.
METHODS
Two large, public intensive care unit (ICU) data sets, MIMIC III and eICU, were used to showcase this method in predicting mortality. For each data set, a balanced and an imbalanced sample were drawn. Conventional metrics for ML classification are reported for methodological comparison. Several ML algorithms were used in the demonstration: logistic regression (LR), linear discriminate analysis (LDA), K-nearest neighbors (KNN), decision tree (DT), naïve bayes (NB), and a neural network (NN). Generalized linear mixed model analyses assessed the effects of case difficulty strata, machine learning algorithm and their interaction in predicting accuracy.
RESULTS
The results illustrated that all classifiers performed better with easier-to-classify cases and that overall the NN performed best. Significant interactions suggest that cases that fall in the most arduous strata should be handled by LR LDA, DT or NN, but not NB or KNN. This demonstration shows that IRT is a viable method for understanding the data that are provided to ML algorithms, independent of outcome measures, and highlights how well classifiers differentiate cases of varying difficulty.
CONCLUSIONS
This method generates an explanation into which features are indicative of healthy states and why. It enables ends users to tailor the classifier appropriate to the difficulty level of the patient for a personalized medicine approach.
CLINICALTRIAL
N/A