Single-molecule sequencing technologies have the potential to improve measurement and analysis of long RNA molecules expressed in cells. However, analysis of error-prone long RNA reads is a current challenge. We present AERON for the estimation of transcript expression and prediction of gene-fusion events. AERON uses an efficient read-to-graph alignment algorithm to obtain accurate estimates for noisy reads. We demonstrate AERON to yield accurate expression estimates on simulated and real datasets. It is the first method to reliably call gene-fusion events from long RNA reads. Sequencing the K562 transcriptome, we used AERON and found known as well as novel gene-fusion events.
Introduction 1Whole-transcriptome sequencing has become an important method in many research projects. Due to the 2 revolution in short-read sequencing technologies and algorithm development, the detection of expressed 3 RNAs in biological samples with thousands of cells [1] and even single cells [2] is nowadays done routinely. 4 There are many important applications of such technologies, for example the detection of disease specific 5 gene expression patterns [3] or the detection of gene fusion events in cancer cells [4], which opens the door 6 for new therapeutic options and novel biological discoveries. However, short-read whole transcriptome 7 sequencing has its weaknesses. RNA molecules can be thousands of nucleotides long and recent studies 8 have revealed that more than 95% of multi-exonic genes undergo alternative splicing [5,6]. Short-read 9 sequencing thus has important limitations when it comes to the accurate quantification of gene isoform 10 expression levels and the detection of gene fusion events [7]. Recent long read sequencing technologies, 11 like developed by Pacific Biosciences (Pacbio) and Oxford Nanopore Technologies (ONT), have made 12 significant progress in sequencing output per run at dramatically reduced costs. Thus, an important 13 current challenge is to develop methods that can use long read RNA sequencing data for tasks where 14 long reads overpower short reads, such as transcript quantification and gene fusion detection. Given their 15 ability to cover large fractions of each transcript -and frequently complete transcripts -longer reads 16 hold the promise of more accurate expression estimates. Although there is much potential in using long 17 1/23 RNA reads for transcript quantification, limited work has been done in this area and, to our knowledge, 18 there are presently no tools for detecting gene-fusion events from long RNA reads.
19Modern bioinformatics tools designed for short read RNA sequencing such as Cufflinks [8], Kallisto [9] 20 and Salmon [10] map short reads to a reference transcriptome and estimate abundances. Similarly, for 21 fusion detection, algorithms such as TopHat-Fusion [11], SOAPfuse [12], MapSplice [13] and others 22 align reads to a reference using a "splice-aware" aligner. They detect fusion events by considering reads 23 overlapping two different genes [14]. All the above methods ...