SummaryIn this paper we i n vestigate the use of receiver operating characteristic (ROC) curve f o r the evaluation of machine learning algorithms. In particular, we i n vestigate the use of the area under the ROC curve ( A UC) as a measure of classi er performance. The machine learning algorithms used are chosen to be representative of those in common use: two decision trees (C4.5 and Multiscale Classi er) two n e u r a l n e t works (Perceptron and Multi-layer Perceptron) and two statistical methods (K-Nearest Neighbours and a Quadratic Discriminant F unction).The evaluation is done using six, \real world," medical diagnostics data sets that contain a varying numbers of inputs and samples, but are primarily continuous input, binary classi cation problems. We i d e n tify three forms of bias that can a ect comparisons of this type (estimation, selection, and expert bias) and detail the methods used to avoid them. We compare and discuss the use of AUC with the conventional measure of classi er performance, overall accuracy (the probability of a correct response). It is found that AUC exhibits a number of desirable properties when compared to overall accuracy: increased sensitivity in Analysis of Variance (ANOVA) tests a standard error that decreased as both AUC and the number of test samples increased decision threshold independent invariant t o a priori class probabilities and it gives an indication of the amount o f \ w ork done" by a classi cation scheme, giving low scores to both random and \one class only" classi ers.It has been known for some time that AUC actually represents the probability that a randomly chosen positive example is correctly rated (ranked) with greater suspicion than a randomly chosen negative example. Moreover, this probability of correct ranking is the same quantity estimated by the non-parametric Wilcoxon statistic. We use this equivalence to show that the standard deviation of AUC, estimated using 10 fold cross validation, is a reliable estimator of the standard error estimated using the Wilcoxon test. The paper concludes with the recommendation that AUC be used in preference to overall accuracy when \single number" evaluation of machine learning algorithms is required.
Draft Only 3 AbstractIn this paper we i n vestigate the use of the area under the receiver operating characteristic (ROC) curve ( A UC) as a performance measure for machine learning algorithms.As a case study we e v aluate six machine learning algorithms (C4.5, Multiscale Classi er, Perceptron, Multi-layer Perceptron, K-Nearest Neighbours, and a Quadratic Discriminant F unction) on six \real world" medical diagnostics data sets. We compare and discuss the use of AUC to the more conventional overall accuracy and nd that AUC exhibits a number of desirable properties when compared to overall accuracy: increased sensitivity in Analysis of Variance (ANOVA) tests a standard error that decreased as both AUC and the number of test samples increased decision threshold independent and it is invariant t o a priori class proba...