BackgroundMicroarray technology can acquire information about thousands of genes simultaneously. We analyzed published breast cancer microarray databases to predict five-year recurrence and compared the performance of three data mining algorithms of artificial neural networks (ANN), decision trees (DT) and logistic regression (LR) and two composite models of DT-ANN and DT-LR. The collection of microarray datasets from the Gene Expression Omnibus, four breast cancer datasets were pooled for predicting five-year breast cancer relapse. After data compilation, 757 subjects, 5 clinical variables and 13,452 genetic variables were aggregated. The bootstrap method, Mann–Whitney U test and 20-fold cross-validation were performed to investigate candidate genes with 100 most-significant p-values. The predictive powers of DT, LR and ANN models were assessed using accuracy and the area under ROC curve. The associated genes were evaluated using Cox regression.ResultsThe DT models exhibited the lowest predictive power and the poorest extrapolation when applied to the test samples. The ANN models displayed the best predictive power and showed the best extrapolation. The 21 most-associated genes, as determined by integration of each model, were analyzed using Cox regression with a 3.53-fold (95% CI: 2.24-5.58) increased risk of breast cancer five-year recurrence…ConclusionsThe 21 selected genes can predict breast cancer recurrence. Among these genes, CCNB1, PLK1 and TOP2A are in the cell cycle G2/M DNA damage checkpoint pathway. Oncologists can offer the genetic information for patients when understanding the gene expression profiles on breast cancer recurrence.
Peripheral blood mononuclear cell (PBMC)-derived gene signatures were investigated for their potential use in the early detection of non-small cell lung cancer (NSCLC). In our study, 187 patients with NSCLC and 310 age- and gender-matched controls, and an independent set containing 29 patients for validation were included. Eight significant NSCLC-associated genes were identified, including DUSP6, EIF2S3, GRB2, MDM2, NF1, POLDIP2, RNF4, and WEE1. The logistic model containing these significant markers was able to distinguish subjects with NSCLC from controls with an excellent performance, 80.7% sensitivity, 90.6% specificity, and an area under the receiver operating characteristic curve (AUC) of 0.924. Repeated random sub-sampling for 100 times was used to validate the performance of classification training models with an average AUC of 0.92. Additional cross-validation using the independent set resulted in the sensitivity 75.86%. Furthermore, six age/gender-dependent genes: CPEB4, EIF2S3, GRB2, MCM4, RNF4, and STAT2 were identified using age and gender stratification approach. STAT2 and WEE1 were explored as stage-dependent using stage-stratified subpopulation. We conclude that these logistic models using different signatures for total and stratified samples are potential complementary tools for assessing the risk of NSCLC.
Molecular signatures in PB can stratify the prognosis in patients with advanced NSCLC. Further prospective, interventional clinical trials should be performed to test if gene profiling also predicts resistance to chemotherapy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.