Background
Gene fusions play a key role as driving oncogenes in tumors, and their reliable discovery and detection is important for cancer research, diagnostics, prognostics and guiding personalized therapy. While discovering gene fusions from genome sequencing can be laborious and costly, the resulting 'fusion transcripts' can be recovered from RNA-seq data of tumor and normal samples. However, alleged and putative fusion transcript can arise from multiple sources in addition to the chromosomal rearrangements yielding fusion genes, including cis- or trans-splicing events, experimental artifacts during RNA-seq or computational errors of transcriptome reconstruction methods. Understanding how to discern, interpret, categorize, and verify predicted fusion transcripts is essential for consideration in clinical settings and prioritization for further research. Here, we present FusionInspector for in silico characterization and interpretation of candidate fusion transcripts from RNA-seq, enabling exploration of sequence and expression characteristics of fusions and their partner genes.
Results
We applied FusionInspector to thousands of tumor and normal transcriptomes, and identified statistical and experimental features enriched among biologically impactful fusions. Through clustering and machine learning, we identified large collections of fusions potentially relevant to tumor and normal biological processes. We show that biologically relevant fusions are enriched for relatively high expression of the fusion transcript, imbalanced fusion allelic ratios, and canonical splicing patterns, and are deficient in sequence microhomologies detected between partner genes.
Conclusion
We demonstrate FusionInspector to accurately in silico validate fusion transcripts, and to help identify numerous understudied fusions in tumor and normal tissues samples. FusionInspector is freely available as open source for screening, characterization, and visualization of candidate fusions via RNA-seq. We believe that this work will continue driving the discipline of transparent explanation and interpretation of machine learning predictions and tracing the predictions to their experimental sources.