Graphical AbstractHighlights d SpliceAI, a 32-layer deep neural network, predicts splicing from a pre-mRNA sequence d 75% of predicted cryptic splice variants validate on RNA-seq d Cryptic splicing may yield 10% of pathogenic variants in neurodevelopmental disorders d Cryptic splice variants frequently give rise to alternative splicing A deep neural network precisely models mRNA splicing from a genomic sequence and accurately predicts noncoding cryptic splice mutations in patients with rare genetic diseases. SUMMARYThe splicing of pre-mRNAs into mature transcripts is remarkable for its precision, but the mechanisms by which the cellular machinery achieves such specificity are incompletely understood. Here, we describe a deep neural network that accurately predicts splice junctions from an arbitrary pre-mRNA transcript sequence, enabling precise prediction of noncoding genetic variants that cause cryptic splicing. Synonymous and intronic mutations with predicted splice-altering consequence validate at a high rate on RNA-seq and are strongly deleterious in the human population. De novo mutations with predicted splice-altering consequence are significantly enriched in patients with autism and intellectual disability compared to healthy controls and validate against RNA-seq in 21 out of 28 of these patients. We estimate that 9%-11% of pathogenic mutations in patients with rare genetic disorders are caused by this previously underappreciated class of disease variation.(legend continued on next page) (F) Relationship between exon-intron length and the strength of the adjoining splice sites, as predicted by SpliceAI-80 nt (local motif score) and SpliceAI-10k. The genome-wide distributions of exon length (yellow) and intron length (pink) are shown in the background. The x axis is in log-scale. (G) A pair of splice acceptor and donor motifs, placed 150 nt apart, are walked along the HMGCR gene. Shown are, at each position, K562 nucleosome signal and the likelihood of the pair forming an exon at that position, as predicted by SpliceAI-10k. The genome-wide Spearman correlation between the two tracks is shown. (H) Average K562 and GM12878 nucleosome signal near private mutations that are predicted by the SpliceAI-10k model to create novel exons in the GTEx cohort.
BackgroundThe investigation of the interconnections between the molecular and genetic events that govern biological systems is essential if we are to understand the development of disease and design effective novel treatments. Microarray and next-generation sequencing technologies have the potential to provide this information. However, taking full advantage of these approaches requires that biological connections be made across large quantities of highly heterogeneous genomic datasets. Leveraging the increasingly huge quantities of genomic data in the public domain is fast becoming one of the key challenges in the research community today.Methodology/ResultsWe have developed a novel data mining framework that enables researchers to use this growing collection of public high-throughput data to investigate any set of genes or proteins. The connectivity between molecular states across thousands of heterogeneous datasets from microarrays and other genomic platforms is determined through a combination of rank-based enrichment statistics, meta-analyses, and biomedical ontologies. We address data quality concerns through dataset replication and meta-analysis and ensure that the majority of the findings are derived using multiple lines of evidence. As an example of our strategy and the utility of this framework, we apply our data mining approach to explore the biology of brown fat within the context of the thousands of publicly available gene expression datasets.ConclusionsOur work presents a practical strategy for organizing, mining, and correlating global collections of large-scale genomic data to explore normal and disease biology. Using a hypothesis-free approach, we demonstrate how a data-driven analysis across very large collections of genomic data can reveal novel discoveries and evidence to support existing hypothesis.
Gene expression studies employing high throughput real time PCR methods require finding uniform conditions for optimal amplification of multiple targets, often a daunting task. We developed a primer database, qPrimerDepot, which provides optimized primers for all human and mouse RefSeq genes. These primers are designed to amplify desired templates under unified annealing temperature. For most intron-bearing genes, primers flank one of the largest introns thus minimizing background noise due to genomic DNA contamination. The qPrimerDepot database can be accessed at and .
During lens fiber cell differentiation, the regulation of crystallin gene expression is coupled with dramatic morphological changes. Here we report that Mafs, Prox1, and Pax6, which are essential transcription factors for normal lens development, bind to three functionally important cis elements, PL1, PL2, and OL2, in the chicken B1-crystallin promoter and may cooperatively direct the transcription of this lens fiber cell preferred gene. Gel shift assays demonstrated that Mafs bind to the MARE-like sequences in the PL1 and PL2 elements, whereas Prox1, a sequence-specific DNA-binding protein like its Drosophila homolog Prospero, interacts with the OL2 element. Furthermore, Pax6, a known repressor of the chicken B1-crystallin promoter, binds to all three of these cis elements. In transfection assays, Mafs and Prox1 activated the chicken B1-crystallin promoter; however, their transactivation ability was repressed when co-transfected with Pax6. Taken together with the known spatiotemporal expression patterns of Mafs, Prox1, and Pax6 in the developing lens, we propose that Pax6 occupies and represses the chicken B1-crystallin promoter in lens epithelial cells, and is displaced by Prox1 and Mafs, which activate the promoter, in differentiating cortical fiber cells.
Profiling the dynamic interaction of p300 with proximal promoters of human T cells identified a class of genes that rapidly coassemble p300 and RNA polymerase II (pol II) following mitogen stimulation. Several of these p300 targets are immediate early genes, including FOS, implicating a prominent role for p300 in the control of primary genetic responses. The recruitment of p300 and pol II rapidly transitions to the assembly of several elongation factors, including the positive transcriptional elongation factor (P-TEFb), the bromodomain-containing protein (BRD4), and the elongin-like eleven nineteen lysine-rich leukemia protein (ELL). However, transcription at many of these rapidly induced genes is transient, wherein swift departure of P-TEFb, BRD4, and ELL coincides with termination of transcriptional elongation. Unexpectedly, both p300 and pol II remain accumulated or ''bookmarked'' at the proximal promoter long after transcription has terminated, demarking a clear mechanistic separation between the recruitment and elongation phases of transcription in vivo. The bookmarked pol II is depleted of both serine-2 and serine-5 phosphorylation of its C-terminal domain and remains proximally positioned at the promoter for hours. Surprisingly, these p300/pol II bookmarked genes can be readily reactivated, and elongation factors can be reassembled by subsequent addition of nonmitogenic agents that, alone, have minimal effects on transcription in the absence of prior preconditioning by mitogen stimulation. These findings suggest that p300 is likely to play an important role in biological processes in which transcriptional bookmarking or preconditioning influences cellular growth and development through the dynamic priming of genes for response to rechallenge by secondary stimuli.gene regulation ͉ histone acetylation ͉ transcription ͉ ELL ͉ epigenetics
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.