16RNA sequencing (RNA-seq) is gaining popularity as a complementary assay to genome 17 sequencing for precisely identifying the molecular causes of rare disorders. An obvious and 18 powerful approach is to identify aberrant gene expression levels as potential pathogenic events.
19However, existing methods for detecting aberrant read counts in RNA-seq data either lack 20 2 assessments of statistical significance, so that establishing cutoffs is arbitrary, or they rely on 21 subjective manual corrections for confounders. Here, we describe OUTRIDER (OUTlier in RNA-22 seq fInDER), an algorithm developed to address these issues. The algorithm uses an 23 autoencoder to model read count expectations according to the co-variation among genes 24 resulting from technical, environmental, or common genetic variations. Given these 25 expectations, the RNA-seq read counts are assumed to follow a negative binomial distribution 26 with a gene-specific dispersion. Outliers are then identified as read counts that significantly 27 deviate from this distribution. The model is automatically fitted to achieve the best correction of 28 artificially corrupted data. Precision-recall analyses using simulated outlier read counts 29 demonstrated the importance of correction for co-variation and of significance-based thresholds.
30OUTRIDER is open source and includes functions for filtering out genes not expressed in a data 31 set, for identifying outlier samples with too many aberrantly expressed genes, and for the P-32 value-based detection of aberrant gene expression, with false discovery rate adjustment.
33Overall, OUTRIDER provides a computationally fast and scalable end-to-end solution for 34 identifying aberrantly expressed genes, suitable for use by rare disease diagnostics platforms.
35
Introduction
36Many patients suspected to suffer a Mendelian disorder undergo whole exome or whole 37 genome sequencing. Nonetheless, in most cases no clear pathogenic variant can be 38 pinpointed 1,2 . A possible reason is that the pathogenic variant is regulatory. However, precisely 39 identifying pathogenic regulatory variants is difficult for at least two reasons. First, an individual 40 harbors a very large number of rare non-coding variants, with about 60,000 non-coding single 41 nucleotide variants compared with 475 protein-affecting rare variants per genome (with MAF < 42 0.005) 3 ; and second, our understanding of regulatory sequences is much poorer than our 43 3 understanding of coding sequences, so the prioritization of regulatory sequences can be 44 difficult.
45Two recent studies have shown that using RNA sequencing (RNA-seq) to directly investigate 46 gene expression defects in patients' cells provides a promising complementary method for 47 pinpointing pathogenic regulatory defects 4,5 . RNA-seq can help to reveal splicing defects, the 48 mono-allelic expression of heterozygous loss-of-function variants, and expression outliers (i.e. 49 genes aberrantly expressed outside their physiological range) 4,5 . The two studies used different 5...