Mammalian promoters can be separated into two classes, conserved TATA box-enriched promoters, which initiate at a well-defined site, and more plastic, broad and evolvable CpG-rich promoters. We have sequenced tags corresponding to several hundred thousand transcription start sites (TSSs) in the mouse and human genomes, allowing precise analysis of the sequence architecture and evolution of distinct promoter classes. Different tissues and families of genes differentially use distinct types of promoters. Our tagging methods allow quantitative analysis of promoter usage in different tissues and show that differentially regulated alternative TSSs are a common feature in protein-coding genes and commonly generate alternative N termini. Among the TSSs, we identified new start sites associated with the majority of exons and with 3' UTRs. These data permit genome-scale identification of tissue-specific promoters and analysis of the cis-acting elements associated with them.
This study describes comprehensive polling of transcription start and termination sites and analysis of previously unidentified full-length complementary DNAs derived from the mouse genome. We identify the 5' and 3' boundaries of 181,047 transcripts with extensive variation in transcripts arising from alternative promoter usage, splicing, and polyadenylation. There are 16,247 new mouse protein-coding transcripts, including 5154 encoding previously unidentified proteins. Genomic mapping of the transcriptome reveals transcriptional forests, with overlapping transcription on both strands, separated by deserts in which few transcripts are observed. The data provide a comprehensive platform for the comparative analysis of mammalian transcriptional regulation in differentiation and development.
We introduce cap analysis gene expression (CAGE), which is based on preparation and sequencing of concatamers of DNA tags deriving from the initial 20 nucleotides from 5 end mRNAs. CAGE allows high-throughout gene expression analysis and the profiling of transcriptional start points (TSP), including promoter usage analysis. By analyzing four libraries (brain, cortex, hippocampus, and cerebellum), we redefined more accurately the TSPs of 11-27% of the analyzed transcriptional units that were hit. The frequency of CAGE tags correlates well with results from other analyses, such as serial analysis of gene expression, and furthermore maps the TSPs more accurately, including in tissue-specific cases. The highthroughput nature of this technology paves the way for understanding gene networks via correlation of promoter usage and gene transcriptional factor expression.full-length cDNA ͉ transcriptome ͉ sequencing ͉ cap-trapping E ven the comparison of mammalian genome draft sequences (1) has left many unanswered questions with regard to the exact identification of expressed genes, their promoter elements, and the network of promoter͞transcriptional factor usage that underlies gene expression. Partial identification of the promoter sites has been provided by gene discovery programs based on the sequencing of full-length cDNA libraries (2-4); these have been instrumental in identifying the sequence of promoter regions, including potentially different promoters (5). Several thousand promoters can be determined by sequencing 5Ј ends from full-length cDNA libraries and mapping the sequences to the genome, thus determining which correspond to coding and regulatory regions, respectively. These analyses can produce statistics on transcriptional start sites derived from large numbers of 5Ј end sequences. However, these methods lack the throughput to provide significantly abundant data for intermediately͞lowly expressed genes, chiefly because the comprehensive sequencing of cDNA libraries is prohibitively expensive. On the other hand, microarrays for high-throughput tissue expression analysis do exist (6), but these cannot determine transcription starting points and therefore cannot be used to accurately identify the cis regulatory elements that will be essential for computing gene networks. Another limitation of microarrays is that the only genes͞transcripts that can be studied are those that have already been identified by the sequencing, which is far from completion (2). Serial analysis of gene expression (SAGE) allows partial sequence information of short tags at the 3Ј ends of mRNAs (7) to be obtained. Although the information is partial, it is amenable to relatively cheap high-throughput digital data collection, because it is based on the cloning and subsequent sequencing of concatamers of short DNA fragments derived from 3Ј ends of multiple mRNAs (http:͞͞cgap.nci.nih.gov͞ SAGE). This method was further improved on by Long-SAGE, which allows for the cloning of 20-nt SAGE tags (8), which mainly identify single loci on the ge...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.