Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks

Agarwal, Vikram; Shendure, Jay

doi:10.1101/416685

Cited by 62 publications

(146 citation statements)

References 64 publications

Supporting

Mentioning

141

Contrasting

Order By: Relevance

“…The more LoF-intolerant a gene is, the more broadly it tends to be expressed across tissues, and at higher levels Karczewski et al, 2019). Even though it is well established that promoter CpG density is associated with these two properties as well (Saxonov et al, 2006;Agarwal, Shendure, 2018;Hartl et al, 2019), we found that neither variable explains our result ( Figure 2, Supplemental Figure S7). First, after stratifying genes according to either expression level or tissue specificity (using RNA-seq data from the GTEx consortium; Methods), we saw a clear relationship between promoter CpG density and LOEUF within each stratum (Figure 2a, b).…”

Section: The Association Between Cpg Density and Lofintolerance Is Nocontrasting

confidence: 83%

Promoter CpG density predicts downstream gene loss-of-function intolerance

Boukas

Björnsson

Hansen

2020

Preprint

View full text Add to dashboard Cite

The aggregation and joint analysis of large numbers of exome sequences has recently made it possible to derive estimates of intolerance to loss-of-function (LoF) variation for human genes. Here, we demonstrate strong and widespread coupling between genic LoFintolerance and promoter CpG density across the human genome. Genes downstream of the most CpG-rich promoters (top 10% CpG density) have a 67.2% probability of being highly LoF-intolerant, using the LOEUF metric from gnomAD. This is in contrast to 7.4% of genes downstream of the most CpG-poor (bottom 10% CpG density) promoters. Combining promoter CpG density with exonic and promoter conservation explains 33.4% of the variation in LOEUF, and the contribution of CpG density exceeds the individual contributions of exonic and promoter conservation. We leverage this to train a simple and easily interpretable predictive model that outperforms other existing predictors and allows us to classify 1,760 genes -which currently lack reliable LOEUF estimates -as highly LoF-intolerant or not. These predictions have the potential to aid in the interpretation of novel patient variants. Moreover, our results reveal that high CpG density is not merely a generic feature of human promoters, but is preferentially encountered at the promoters of the most selectively constrained genes, calling into question the prevailing view that CpG islands are not subject to selection.

show abstract

Section: The Association Between Cpg Density and Lofintolerance Is Nocontrasting

confidence: 83%

Promoter CpG density predicts downstream gene loss-of-function intolerance

Boukas

Björnsson

Hansen

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…The following regression algorithms were used: linear regression, ridge regression, lasso, elastic net, random forest, support vector machines with nested cross-validation, and k-nearest neighbour regression 81 . To include information from the regulatory DNA sequences in the shallow models, k-mers of lengths 4 to 6 bp were extracted from the regulatory DNA sequences 82 Table S1-6), which included inception layers 84 (ii) 1 to 2 bidirectional recurrent neural network (RNN) layers 85 , and (iii) 1 to 2 fully connected (FC) layers, in a global architecture layout CNN-RNN-FC 30,[86][87][88] . Training the networks both (i) concurrently or (ii) consecutively, by weight transfer on different variables (regulatory sequences to CNN and RNN, numeric variables to FC), showed that the architecture yielding best results was a concurrently trained CNN (3 layers)-FC (2 layers) 12,89-91 , which was used for all models.…”

Section: Modeling and Statistical Analysismentioning

confidence: 99%

“…This limits the experimental studies to individual regulatory gene parts in the context of single reporter genes. Similarly, with natural systems, the majority of studies on mRNA transcription in the context of transcription factor (TF) binding 25 , chromatin accessibility 26,27 and Chip-seq or DNase-Seq data 28,29 , focus solely on promoter regions 30 . Therefore, both the current natural and synthetic approaches are fundamentally limited in their ability to study the relationship between the different parts of the gene regulatory structure and their cooperative regulation of expression.…”

Section: Introductionmentioning

confidence: 99%

Gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure

Zrimec

Buric

Muhammad

et al. 2019

Preprint

View full text Add to dashboard Cite

Understanding the genetic regulatory code that governs gene expression is a primary, yet challenging aspiration in molecular biology that opens up possibilities to cure human diseases and solve biotechnology problems. However, the fundamental question of how each of the individual coding and non-coding regions of the gene regulatory structure interact and contribute to the mRNA expression levels remains unanswered. Considering that all the information for gene expression regulation is already present in living cells, here we applied deep learning on over 20,000 mRNA datasets to learn the genetic regulatory code controlling mRNA expression in 7 model organisms ranging from bacteria to human.We show that in all organisms, mRNA abundance can be predicted directly from the DNA sequence with high accuracy, demonstrating that up to 82% of the variation of gene expression levels is encoded in the gene regulatory structure. Coding and non-coding regions carry both overlapping and orthogonal information and additively contribute to gene expression levels. By searching for DNA regulatory motifs present across the whole gene regulatory structure, we discover that motif interactions can regulate gene expression levels in a range of over three orders of magnitude. The uncovered co-evolution of coding and non-coding regions challenges the current paradigm that single motifs or regions are solely responsible for gene expression levels. Instead, we propose a holistic system that spans all regions of the gene structure and is required to analyse, understand, and design any future gene expression systems.

show abstract

“…Recently, three deep neural network models have been developed to predicted gene expression levels from DNA sequences [5,6,7]. The ExPecto framework employs a convolutional neural network, consisting of 7 convolution layers, 2 linear layers, and other layers such as pooling and Table 1.…”

Section: Introductionmentioning

confidence: 99%

“…The Basenji architecture [6] consists of 12 convolution layers and other layers such as pooling and ReLU layers, and it predicts mRNA level of a gene directly from the DNA sequence of 131 kbps long. The Xpresso model [7] is composed of two convolution layers, two fully connected layers, and other layers such as pooling layers, and it predicts the expression level of gene from a DNA sequence of 10.5 kbps (7 kbps upstream and 3.5 kbps downstream of the TSS) and several mRNA features.…”

Section: Introductionmentioning

confidence: 99%

Predicting Gene Expression from DNA Sequence using Residual Neural Network

Zhang

Zhou

Cai

2020

Preprint

View full text Add to dashboard Cite

It is known that cis-acting DNA motifs play an important role in regulating gene expression. The genome in a cell thus contains the information that not only encodes for the synthesis of proteins but also is necessary for regulating expression of genes. Therefore, the mRNA level of a gene may be predictable from the DNA sequence. Indeed, three deep neural network models were developed recently to predict the mRNA level of a gene directly or indirectly from the DNA sequence around the transcription start side of the gene. In this work, we develop a deep residual network model, named ExpResNet, to predict gene expression directly from DNA sequence. Applying ExpResNet to the GTEx data, we demonstrate that ExpResNet outperforms the three existing models across four tissues tested. Our model may be useful in the investigation of gene regulation.

show abstract

Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks

Cited by 62 publications

References 64 publications

Promoter CpG density predicts downstream gene loss-of-function intolerance

Promoter CpG density predicts downstream gene loss-of-function intolerance

Gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure

Predicting Gene Expression from DNA Sequence using Residual Neural Network

Contact Info

Product

Resources

About