Genomic identification of driver mutations and genes in cancer cells are critical for precision medicine. Due to difficulty in modeling distribution of background mutations, existing statistical methods are often underpowered to discriminate driver genes from passenger genes. Here we propose a novel statistical approach, weighted iterative zero-truncated negative-binomial regression (WITER), to detect cancer-driver genes showing an excess of somatic mutations. By solving the problem of inaccurately modeling background mutations, this approach works even in small or moderate samples. Compared to alternative methods, it detected more significant and cancer-consensus genes in all tested cancers. Applying this approach, we estimated 178 driver genes in 26 different cancers types. In silico validation confirmed 90.5% of predicted genes as likely known drivers and 7 genes unique for individual cancers as likely new drivers. The technical advances of WITER enable the detection of driver genes in TCGA datasets as small as 30 subjects, rescuing more genes missed by alternative tools.
Running title: powerful cancer-driver gene detection methodIn the present study, we built a model to predict high-frequency cancer driver potential to use as prior weights, by the random forest trained in a large cancer somatic mutation database, COSMIC (V83). (See details in the supplementary notes). One can also use other methods to produce the prior weights.
Tier III: A schedule of integrate independent reference samples to stabilize the regression model for small samplesWhen the sample size is small, it is difficult to build a stable regression model. Note that the key idea of ITER and WITER is to build a prediction model for background passenger genes. When the mutation rates of passenger genes of two cancers are similar, it may be workable to integrate background genes of one cancer into another cancer. We proposed a reference sample strategy for construct a stable ITER or WITER model in small samples. This is carried out at two stages. i. The above ITER or WITER is used to produce p-values for excess of somatic mutations at genes in a reference sample which have enough variants. Genes with p-values less than a very loose cutoff, say FDR 0.3, are then excluded. ii. The somatic mutations of retained genes are integrated with the local small sample and input into ITER or WITER to build a new model. The excess of somatic mutations and corresponding p-values at genes are calculated based on the new model.
Curation of cancer-specific predictors of somatic mutationsWe collected 4 types of cancer-specific predictors for somatic mutations, copy number variation (CNV), gene expression, DNA methylation, and chromatin accessibility by ATAC-Seq. All data were produced from TCGA cohorts and the preprocessed data were downloaded from https://xenabrowser.net. The pan-CNVs mapped onto genes were downloaded by the link, https://tcga.xenahubs.net/download/TCGA.PANCAN.sampleMap/Gistic2_CopyNumber_Gistic2 _all_thresholded.by_genes.gz. We then wrote a progr...