SUMMARY Transcription factor (TF) DNA sequence preferences direct their regulatory activity, but are currently known for only ~1% of all eukaryotic TFs. Broadly sampling DNA-binding domain (DBD) types from multiple eukaryotic clades, we determined DNA sequence preferences for >1,000 TFs encompassing 54 different DBD classes from 131 diverse eukaryotes. We find that closely related DBDs almost always have very similar DNA sequence preferences, enabling inference of motifs for ~34% of the ~170,000 known or predicted eukaryotic TFs. Sequences matching both measured and inferred motifs are enriched in ChIP-seq peaks and upstream of transcription start sites in diverse eukaryotic lineages. SNPs defining expression quantitative trait loci in Arabidopsis promoters are also enriched for predicted TF binding sites. Importantly, our motif “library” (http://cisbp.ccbr.utoronto.ca) can be used to identify specific TFs whose binding may be altered by human disease risk alleles. These data present a powerful resource for mapping transcriptional networks across eukaryotes.
RNA-binding proteins are key regulators of gene expression, yet only a small fraction have been functionally characterized. Here we report a systematic analysis of the RNA motifs recognized by RNA-binding proteins, encompassing 205 distinct genes from 24 diverse eukaryotes. The sequence specificities of RNA-binding proteins display deep evolutionary conservation, and the recognition preferences for a large fraction of metazoan RNA-binding proteins can thus be inferred from their RNA-binding domain sequence. The motifs that we identify in vitro correlate well with in vivo RNA-binding data. Moreover, we can associate them with distinct functional roles in diverse types of post-transcriptional regulation, enabling new insights into the functions of RNA-binding proteins both in normal physiology and in human disease. These data provide an unprecedented overview of RNA-binding proteins and their targets, and constitute an invaluable resource for determining post-transcriptional regulatory mechanisms in eukaryotes.
Introduction Advancing whole-genome precision medicine requires understanding how gene expression is altered by genetic variants, especially those that are outside of protein-coding regions. We developed a computational technique that scores how strongly genetic variants alter RNA splicing, a critical step in gene expression whose disruption contributes to many diseases, including cancers and neurological disorders. A genome-wide analysis reveals tens of thousands of variants that alter splicing and are enriched with a wide range of known diseases. Our results provide insight into the genetic basis of spinal muscular atrophy, hereditary nonpolyposis colorectal cancer and autism spectrum disorder. Methods We used machine learning to derive a computational model that takes as input DNA sequences and applies general rules to predict splicing in human tissues. Given a test variant, our model computes a score that predicts how much the variant disrupts splicing. The model was derived in such a way that it can be used to study diverse diseases and disorders, and to determine the consequences of common, rare, and even spontaneous variants. Results Our technique is able to accurately classify disease-causing variants and provides insights into the role of aberrant splicing in disease. We scored over 650,000 DNA variants and found that disease-causing variants have higher scores than common variants and even those associated with disease in genome-wide association studies. Our model predicts substantial and unexpected aberrant splicing due to variants within introns and exons, including those far from the splice site. For example, among intronic variants that are more than 30 nucleotides away from a splice site, known disease variants alter splicing nine times more often than common variants; among missense exonic disease variants, those that least impact protein function are over five times more likely to alter splicing than other variants. Autism has been associated with disrupted splicing in brain regions, so we used our method to score variants detected using whole genome sequencing data from individuals with and without autism. Genes with high scoring variants include many that have been previously linked with autism, as well as new genes with known neurodevelopmental phenotypes. Most of the high scoring variants are intronic and cannot be detected by exome analysis techniques. When we score clinical variants in spinal muscular atrophy and colorectal cancer genes, up to 94% of variants found to disrupt splicing using minigene reporters are correctly classified. Discussion In the context of precision medicine, causal support for variants that is independent of existing studies is greatly needed. Our computational model was trained to predict splicing from DNA sequence alone, without using disease annotations or population data. Consequently, its predictions are independent of and complementary to population data, genome-wide association studies (GWAS), expression-based quantitative trait loci (QTL), and functi...
Cancer stem cells are critical for cancer initiation, development, and treatment resistance. Our understanding of these processes, and how they relate to glioblastoma heterogeneity, is limited. To overcome these limitations, we performed single-cell RNA sequencing on 53586 adult glioblastoma cells and 22637 normal human fetal brain cells, and compared the lineage hierarchy of the developing human brain to the transcriptome of cancer cells. We find a conserved neural tri-lineage cancer hierarchy centered around glial progenitor-like cells. We also find that this progenitor population contains the majority of the cancer’s cycling cells, and, using RNA velocity, is often the originator of the other cell types. Finally, we show that this hierarchal map can be used to identify therapeutic targets specific to progenitor cancer stem cells. Our analyses show that normal brain development reconciles glioblastoma development, suggests a possible origin for glioblastoma hierarchy, and helps to identify cancer stem cell-specific targets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.