2018
DOI: 10.1101/416685
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks

Abstract: Algorithms that accurately predict gene structure from primary sequence alone were transformative for annotating the human genome. Can we also predict the expression levels of genes based solely on genome sequence? Here we sought to apply deep convolutional neural networks towards this goal. Surprisingly, a model that includes only promoter sequences and features associated with mRNA stability explains 59% and 71% of variation in steady-state mRNA levels in human and mouse, respectively. This model, which we c… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

4
141
1

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 62 publications
(146 citation statements)
references
References 64 publications
4
141
1
Order By: Relevance
“…The more LoF-intolerant a gene is, the more broadly it tends to be expressed across tissues, and at higher levels Karczewski et al, 2019). Even though it is well established that promoter CpG density is associated with these two properties as well (Saxonov et al, 2006;Agarwal, Shendure, 2018;Hartl et al, 2019), we found that neither variable explains our result ( Figure 2, Supplemental Figure S7). First, after stratifying genes according to either expression level or tissue specificity (using RNA-seq data from the GTEx consortium; Methods), we saw a clear relationship between promoter CpG density and LOEUF within each stratum (Figure 2a, b).…”
Section: The Association Between Cpg Density and Lofintolerance Is Nocontrasting
confidence: 83%
“…The more LoF-intolerant a gene is, the more broadly it tends to be expressed across tissues, and at higher levels Karczewski et al, 2019). Even though it is well established that promoter CpG density is associated with these two properties as well (Saxonov et al, 2006;Agarwal, Shendure, 2018;Hartl et al, 2019), we found that neither variable explains our result ( Figure 2, Supplemental Figure S7). First, after stratifying genes according to either expression level or tissue specificity (using RNA-seq data from the GTEx consortium; Methods), we saw a clear relationship between promoter CpG density and LOEUF within each stratum (Figure 2a, b).…”
Section: The Association Between Cpg Density and Lofintolerance Is Nocontrasting
confidence: 83%
“…The following regression algorithms were used: linear regression, ridge regression, lasso, elastic net, random forest, support vector machines with nested cross-validation, and k-nearest neighbour regression 81 . To include information from the regulatory DNA sequences in the shallow models, k-mers of lengths 4 to 6 bp were extracted from the regulatory DNA sequences 82 Table S1-6), which included inception layers 84 (ii) 1 to 2 bidirectional recurrent neural network (RNN) layers 85 , and (iii) 1 to 2 fully connected (FC) layers, in a global architecture layout CNN-RNN-FC 30,[86][87][88] . Training the networks both (i) concurrently or (ii) consecutively, by weight transfer on different variables (regulatory sequences to CNN and RNN, numeric variables to FC), showed that the architecture yielding best results was a concurrently trained CNN (3 layers)-FC (2 layers) 12,89-91 , which was used for all models.…”
Section: Modeling and Statistical Analysismentioning
confidence: 99%
“…This limits the experimental studies to individual regulatory gene parts in the context of single reporter genes. Similarly, with natural systems, the majority of studies on mRNA transcription in the context of transcription factor (TF) binding 25 , chromatin accessibility 26,27 and Chip-seq or DNase-Seq data 28,29 , focus solely on promoter regions 30 . Therefore, both the current natural and synthetic approaches are fundamentally limited in their ability to study the relationship between the different parts of the gene regulatory structure and their cooperative regulation of expression.…”
Section: Introductionmentioning
confidence: 99%
“…Recently, three deep neural network models have been developed to predicted gene expression levels from DNA sequences [5,6,7]. The ExPecto framework employs a convolutional neural network, consisting of 7 convolution layers, 2 linear layers, and other layers such as pooling and Table 1.…”
Section: Introductionmentioning
confidence: 99%
“…The Basenji architecture [6] consists of 12 convolution layers and other layers such as pooling and ReLU layers, and it predicts mRNA level of a gene directly from the DNA sequence of 131 kbps long. The Xpresso model [7] is composed of two convolution layers, two fully connected layers, and other layers such as pooling layers, and it predicts the expression level of gene from a DNA sequence of 10.5 kbps (7 kbps upstream and 3.5 kbps downstream of the TSS) and several mRNA features.…”
Section: Introductionmentioning
confidence: 99%