RNA sequencing (RNA-seq) is gaining popularity as a complementary assay to genome sequencing for precisely identifying the molecular causes of rare disorders. A powerful approach is to identify aberrant gene expression levels as potential pathogenic events. However, existing methods for detecting aberrant read counts in RNA-seq data either lack assessments of statistical significance, so that establishing cutoffs is arbitrary, or rely on subjective manual corrections for confounders. Here, we describe OUTRIDER (Outlier in RNA-Seq Finder), an algorithm developed to address these issues. The algorithm uses an autoencoder to model read-count expectations according to the gene covariation resulting from technical, environmental, or common genetic variations. Given these expectations, the RNA-seq read counts are assumed to follow a negative binomial distribution with a gene-specific dispersion. Outliers are then identified as read counts that significantly deviate from this distribution. The model is automatically fitted to achieve the best recall of artificially corrupted data. Precision-recall analyses using simulated outlier read counts demonstrated the importance of controlling for covariation and significance-based thresholds. OUTRIDER is open source and includes functions for filtering out genes not expressed in a dataset, for identifying outlier samples with too many aberrantly expressed genes, and for detecting aberrant gene expression on the basis of false-discovery-rate-adjusted p values. Overall, OUTRIDER provides an end-to-end solution for identifying aberrantly expressed genes and is suitable for use by rare-disease diagnostic platforms.
16RNA sequencing (RNA-seq) is gaining popularity as a complementary assay to genome 17 sequencing for precisely identifying the molecular causes of rare disorders. An obvious and 18 powerful approach is to identify aberrant gene expression levels as potential pathogenic events. 19However, existing methods for detecting aberrant read counts in RNA-seq data either lack 20 2 assessments of statistical significance, so that establishing cutoffs is arbitrary, or they rely on 21 subjective manual corrections for confounders. Here, we describe OUTRIDER (OUTlier in RNA-22 seq fInDER), an algorithm developed to address these issues. The algorithm uses an 23 autoencoder to model read count expectations according to the co-variation among genes 24 resulting from technical, environmental, or common genetic variations. Given these 25 expectations, the RNA-seq read counts are assumed to follow a negative binomial distribution 26 with a gene-specific dispersion. Outliers are then identified as read counts that significantly 27 deviate from this distribution. The model is automatically fitted to achieve the best correction of 28 artificially corrupted data. Precision-recall analyses using simulated outlier read counts 29 demonstrated the importance of correction for co-variation and of significance-based thresholds. 30OUTRIDER is open source and includes functions for filtering out genes not expressed in a data 31 set, for identifying outlier samples with too many aberrantly expressed genes, and for the P-32 value-based detection of aberrant gene expression, with false discovery rate adjustment. 33Overall, OUTRIDER provides a computationally fast and scalable end-to-end solution for 34 identifying aberrantly expressed genes, suitable for use by rare disease diagnostics platforms. 35 Introduction 36Many patients suspected to suffer a Mendelian disorder undergo whole exome or whole 37 genome sequencing. Nonetheless, in most cases no clear pathogenic variant can be 38 pinpointed 1,2 . A possible reason is that the pathogenic variant is regulatory. However, precisely 39 identifying pathogenic regulatory variants is difficult for at least two reasons. First, an individual 40 harbors a very large number of rare non-coding variants, with about 60,000 non-coding single 41 nucleotide variants compared with 475 protein-affecting rare variants per genome (with MAF < 42 0.005) 3 ; and second, our understanding of regulatory sequences is much poorer than our 43 3 understanding of coding sequences, so the prioritization of regulatory sequences can be 44 difficult. 45Two recent studies have shown that using RNA sequencing (RNA-seq) to directly investigate 46 gene expression defects in patients' cells provides a promising complementary method for 47 pinpointing pathogenic regulatory defects 4,5 . RNA-seq can help to reveal splicing defects, the 48 mono-allelic expression of heterozygous loss-of-function variants, and expression outliers (i.e. 49 genes aberrantly expressed outside their physiological range) 4,5 . The two studies used different 5...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.