17Genome wide association studies (GWASs) for complex traits have implicated thousands of genetic 18 loci. Most GWAS-nominated variants lie in noncoding regions, complicating the systematic translation 19 of these findings into functional understanding. Here, we leverage convolutional neural networks to 20 assist in this challenge. Our computational framework, peaBrain, models the transcriptional machinery 21 of a tissue as a two-stage process: first, predicting the mean tissue specific abundance of all genes and 22Most reported disease-associated variation for complex traits lies in non-coding regions of the genome 1 . 34Despite advances in discovery and annotations of functional non-coding elements across the genome 2-35 5 , characterising the consequences of non-coding variants remains a major challenge in human genetics. 36Prediction of the transcriptomic consequences of non-coding variation represents one solution [6][7][8][9][10] . 37Current methods of variant-expression prediction can be broadly divided into two classes: (a) methods 38 that predict alterations in epigenetic and transcription factor binding sites (TFBS), such as DeepSEA 8 39 and Basset 10 ; and (b) methods that directly predict RNA abundance from genotype or sequence data, 40 such as PrediXcan 6 and TWAS 9 . Methods in the former category do not capture differences in transcript 41 expression as a result of genotypic variation 8,10 and are relatively poor predictors of alterations in the 42 histone code 8 ; methods in the latter category are not able to identify which of the variants detected 43 within an eQTL association locus are functional 6,9 . 44
45To address these concerns, here, we introduce a single framework, called promoter-and-enhancer-46 derived abundance (peaBrain) model, which consolidates both of these approaches. Within the 47 class-C models, we added additional channels corresponding, for those tissues where such data were 87 available, to the consolidated epigenomes from the Epigenomics Roadmap, including tissue-specific 88 peaks from H3K4me1, H3K4me3, H3K9ac, H3K9me3, H3K27me3, and H3K36me3 ChIP-seq 89 experiments, and experimentally-derived DNase hotspots 17 . 90
91We observed that DNA-only (class-A) models captured nearly a fifth of the variance in mean gene 92 abundance across all GTEx tissues (10-fold cross-validated median out-of-sample-r 2 [oos-r 2 ] values 93 across all tissues = 17%). Addition of non-specific regulatory annotations (class-B models) markedly 94 6 improved model performance across all tissues (median cross-validated oos-r 2 = 45%; Figure 1). (We 95 average the oos-r 2 across all 10-folds within a tissue and use the median across all tissues to assess 96 global performance; see Online Methods.) For example, for EBV-transformed lymphocytes, the 10-97 fold cross-validated average oos-r 2 is 56% for the class-B model compared to the 15% in the 98 corresponding class-A model. Addition of tissue-specific annotations further improved model 99 performance, such that class-C models captured more than half t...