Linking distal enhancers to genes and modeling their impact on target gene expression are longstanding unresolved problems in regulatory genomics and critical for interpreting non-coding genetic variation. Here we present a new deep learning approach called GraphReg that exploits 3D interactions from chromosome conformation capture assays in order to predict gene expression from 1D epigenomic data or genomic DNA sequence. By using graph attention networks to exploit the connectivity of distal elements and promoters, GraphReg more faithfully models gene regulation and more accurately predicts gene expression levels than dilated convolutional neural networks (CNNs), the current state-of-the-art deep learning approach for this task. Feature attribution used with GraphReg accurately identifies functional enhancers of genes, as validated by CRISPRi-FlowFISH and TAP-seq assays, outperforming both CNNs and the recently proposed Activity-by-Contact model. GraphReg therefore represents an important advance in modeling the regulatory impact of epigenomic and sequence elements.
Decoding transcription factor (TF) binding signals in genomic DNA is a fundamental problem. Here we present a prediction model called BindSpace that learns to embed DNA sequences and TF class/family labels into the same space. By training on binding data for hundreds of TFs and embedding over 1M DNA sequences, BindSpace achieves state-of-the-art multiclass binding prediction performance, in vitro and in vivo, and can distinguish signals of closely related TFs. MainDirect measurement of genome-wide transcription factor (TF) occupancy for all expressed factors in a cell type of interest is practically infeasible outside of large consortium projects. Therefore, computational prediction of TF binding to cognate sites at relevant loci -e.g. chromatin accessible regions or putative enhancers defined by active histone marks -is of critical importance. Massive efforts to define the intrinsic binding affinities of TFs by protein binding microarray (PBMs) 1 , cognate site identification (CSI) 2 , genomic-context PBM (gc-PBM) 3 , mechanically induced trapping of molecular interactions (MITOMI) 4 and high-throughput SELEX followed by sequencing (HT-SELEX) 5 provide large-scale data sets for training binding models. However, these in vitro binding experiments are typically summarized as position-specific weight matrix (PWM) motifs, losing both specificity and sensitivity and leading to nearidentical motifs for closely related TFs. Supervised learning methods have improved accuracy of discrimination between bound and unbound sequences of individual TFs 6-9 but have not addressed the multiclass nature of the problem and therefore are not optimized to distinguish between TFs with similar binding signals.Here we present a novel multiclass and multilabel method to jointly learn binding preferences of hundreds of assayed TFs by embedding their bound/unbound DNA sequences and class labels into a common space. Our method, called BindSpace, learns accurate binding models for individual TFs while enabling discrimination between different TFs in the same family. To train BindSpace, we combined HT-SELEX in vitro binding experiments for 461 mouse and human TFs from previous large-scale studies 5,8 . After applying rigorous quality control (Methods), we used 270 experiments for 243 transcription factors for our training set. The top 2000 enriched probes from each of these experiments were used as positive examples, yielding over 500K positive training sequences. We randomly sampled universal negatives from initial HT-SELEX probe libraries as well as non-accessible genomic regions to obtain ~500K negative training sequences (Methods). Each sequence is represented as a bag of 8-mers, each containing up to two consecutive wild cards, and each bag is associated either with both a TF label (e.g. HOXA2) and a TF family label (e.g. Homeodomain) or with a universal negative label. In this study, we used two thirds of the HT-SELEX data for training and one third for testing, and we performed 5-fold cross validation on the training data...
We investigated tumor-cell-intrinsic chromatin accessibility patterns of pancreatic ductal adenocarcinoma (PDAC) by ATAC-seq on EpCAM+ PDAC malignant epithelial cells, sorted from 54 freshly resected human tumors, and discovered a signature of 1092 chromatin loci displaying differential accessibility between patients with disease free survival (DFS) < 1 year and patients with DFS > 1 year. Analyzing transcription factor (TF) binding motifs within these loci, we identified two TFs (ZKSCAN1 and HNF1b) displaying differential nuclear localization between patients with short vs. long DFS. We further developed a novel chromatin accessibility microarray methodology termed ATAC-Array, an easy-to-use platform obviating the time and cost of next generation sequencing. Applying this novel methodology to the original ATAC-seq libraries as well as independent libraries generated from patient-derived organoids, we validated ATAC-array technology in both the original ATAC-Seq cohort as well as in an independent validation cohort. We conclude that PDAC prognosis can be predicted by ATAC-array, which represents a novel, low-cost, clinically feasible technology for assessing chromatin accessibility profiles.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.