Most discoveries of cancer biomarkers involve construction of a single model to determine predictions of survival. 'Data-mining' techniques, such as artificial neural networks (ANNs), perform better than traditional methods, such as logistic regression. In this study, the quality of multiple predictive models built on a molecular data set for colorectal cancer (CRC) was evaluated. Predictive models (logistic regressions, ANNs, and decision trees) were compared, and the effect of techniques for variable selection on the predictive quality of these models was investigated. The Kolmogorov-Smirnoff (KS) statistic was used to compare the models. Overall, the logistic regression and ANN methods outperformed use of a decision tree. In some instances (e.g., for a model that included 'all variables without tumor stage' and use of a decision tree for variable selection), the ANN marginally outperformed logistic regression, although the difference between the accuracy of the KS statistic was minimal (0.80 versus 0.82). Regardless of the variable(s) and the methods for variable selection, all three predictive models identified survivors and non-survivors with the same level of statistical accuracy.
KeywordsArtificial neural networks; Colorectal cancer; Decision trees; Kolmogorov-Smirnoff statistic; Logistic regression; Predictive models
INTRODUCTIONResearchers are now examining the methodology for predicting the survival or disease recurrence in cancer patients by use of data-mining techniques, such as artificial neural networks (ANNs), decision trees, and k-nearest neighbor (k-NN). In particular, this is being done to predict the clinical outcome for patients with colorectal cancer (CRC) (1-4).Send correspondence to: Upender Manne, Department of Pathology, University of Alabama at Birmingham, 515B1-Kracke Building 619, 19 th Street South, Birmingham, AL, manne@uab.edu. of the TNM components of tumor staging variables by themselves in an ANN significantly increased the predictive accuracy of the variables when compared to a model for survival analysis with the same variables. The predictive accuracy of the variables used in the models was measured by the area-under-the-ROC curve. The ANN increased the predictive accuracy of the model by 44-74%. Also, ANNs were used by this group to build predictive models for breast cancer survival; they found that, compared to the TNM staging, the ANN provided better predictive accuracy (5, 6). Therefore, it is not clear whether the improved predictive capacity was a reflection of the ANN method, or whether the variables such as positive lymph nodes and p53 status contributed to improving the predictability of a model. Furthermore, the predictive accuracy of a model is a factor of the variables in the model, and the technique for variable selection determines the quality of a model (7).
NIH Public Access
Author ManuscriptFront Biosci (Elite EdANNs and other data mining tools are called 'black-box' techniques, since the logic used to determine the final model is not transparent....