Abstract. Colon cancer is the third most common cancer and one of the leading causes of cancer-related death in the world. Therefore, identification of biomarkers with potential in recognizing the biological characteristics is a key problem for early diagnosis of colon cancer patients. In this study, we used a random forest approach to discover biomarkers based on a set of oligonucleotide microarray data of colon cancer. Realtime PCR was used to validate the related expression levels of biomarkers selected by our approach. Furthermore, ROC curves were used to analyze the sensitivity and specificity of each biomarker in both training and test sample sets. Finally, we analyzed the clinical significance of each biomarker based on their differential expression. A single classifier consisting of 4 genes (IL8, WDR77, MYL9 and VIP) was selected by random forests with an average sensitivity and specificity of 83.75 and 76.15%. The differential expression levels of each biomarker was validated by real-time PCR in 48 test colon cancer samples compared to the matched normal tissues. Patients with high expression of IL8 and WDR77, and low expression of MYL9 and VIP had a significantly reduced median survival rate compared to colon cancer patients. The results indicate that our approach can be employed for biomarker identification based on microarray data. These 4 genes identified by our approach have the potential to act as clinical biomarkers for the early diagnosis of colon cancer.
IntroductionColon cancer is the third most common cancer, and one of the leading causes of morbidity and mortality in the world (1). According to the United States' statistics released in 2010 the incidence rate of colon has decreased (2). Over the last decade, many studies have proposed various kinds of statistical methods to analyze gene expression patterns and identify new biomarkers for prognostic and/or predictive information in relation to human diseases (3,4). However, most of the early studies applied unsupervised approaches to data-mining and identification of differential gene expressed profiling of certain diseases, such as hierarchical clustering for class discovering, taking an unbiased approach to searching for subgroups in the data (5). Along with the statistical methods extensively penetrated into the field of biomedicine, many supervised clustering analysis and machine learning approaches were adopted to deal with gene expression profiling data and sieved feature genes which contained more information to classify different kinds of diseases or subclasses of the same disease.Various methods of statistics and machine learning, including clustering (6,7), Bayesian algorithms (8), and support vector machines (9), have been proposed to analyze microarray data generated through high-throughput experiments. Over the last few years, the technology of multiclassifier fusion developed substantially, and became very successful in improving the accuracy of certain classifiers. Random forests (RF) (10,11), a tree-based method of classificat...