Reliable identification of expressed somatic insertion/deletion (indels) is an unmet demand due to artifacts generated in PCR-based RNA-Seq library preparation and the lack of normal RNA-Seq data, presenting analytical challenges for discovery of somatic indels in tumor trasncriptome.By implementing features characterized by PCR-free whole-genome and whole-exome sequencing into a machine-learning framework, we present RNAIndel, a tool for predicting somatic, germline and artifact indels from tumor RNA-Seq data alone. RNAIndel robustly predicts 87 93% of somatic indels from 235 samples with heterogeneous conditions, even recovering subclonal (VAF range 0.01-0.15) driver indels missed by targeted deep-sequencing, outperforming the current best-practice for RNA-Seq variant calling which had 57% sensitivity but with 12 times more false positives.RNAIndel is freely available at https://github.com/stjude/RNAIndel Contact: jinghui.zhang@stjude.org
IntroductionTranscriptome sequencing (RNA-Seq) is a versatile platform for performing a multitude of cancer genomic analyses such as gene expression profiling, allele specific expression, alternative splicing and fusion transcript detection. However, variant identification in RNA-Seq is not a common practice due to the presence of artifacts introduced in library preparation, the intrinsic complexity of splicing, and RNA editing (Piskol et al., 2013). RNA-Seq data are predominantly generated from tumor-only samples as acquisition of a normal tissue with a comparable transcriptome is a rare practice. This lack of matching normal data further complicates somatic variant discovery in RNA-Seq. Despite these challenges, there are compelling reasons to explore RNA-Seq data for variant detection: 1) RNA variants are expressed, therefore more interpretable to cancer phenotype and clinical actionability than DNA variants; and 2) Some studies only analyze tumor specimen by RNA-Seq, and employing variant detection will make full use of the available data resources. Thus, successful development of RNA-Seq variant calling tools will make this platform an interpretable and cost-effective alternative to DNA-based whole-genome or whole-exome sequencing (DNA-Seq), the current standard platform for somatic variant detection. Various single nucleotide variants (SNV) detection tools dedicated to RNA-Seq have been developed. SNPiR (Piskol et al., 2013) calls true RNA-Seq SNVs by hard-filtering calls in repetitive and low-quality regions, around splice sites, and at known RNA-editing sites. RVboost (Wang et al., 2014) is a machine-learning method to prioritize true SNVs trained on common SNPs in the input RNA-Seq data. eSNV-Detect (Tang et al., 2016) incorporates results generated from two mappers to confidently call expressed SNVs by removing mapping artifacts from individual mappers. Opossum (Oikkonen and Lise 2017) preprocesses RNA-Seq reads for SNV calling by splitting intron-spanning reads and removing spurious reads. By contrast, inde detection in transcriptome is more challengingand has been ...