High-throughput sequencing of full-length transcripts using long reads has paved the way for the discovery of thousands of novel transcripts, even in well-annotated mammalian species. The advances in sequencing technology have created a need for studies and tools that can characterize these novel variants. Here, we present SQANTI, an automated pipeline for the classification of long-read transcripts that can assess the quality of data and the preprocessing pipeline using 47 unique descriptors. We apply SQANTI to a neuronal mouse transcriptome using Pacific Biosciences (PacBio) long reads and illustrate how the tool is effective in characterizing and describing the composition of the full-length transcriptome. We perform extensive evaluation of ToFU PacBio transcripts by PCR to reveal that an important number of the novel transcripts are technical artifacts of the sequencing approach and that SQANTI quality descriptors can be used to engineer a filtering strategy to remove them. Most novel transcripts in this curated transcriptome are novel combinations of existing splice sites, resulting more frequently in novel ORFs than novel UTRs, and are enriched in both general metabolic and neural-specific functions. We show that these new transcripts have a major impact in the correct quantification of transcript levels by state-of-the-art short-read-based quantification algorithms. By comparing our iso-transcriptome with public proteomics databases, we find that alternative isoforms are elusive to proteogenomics detection. SQANTI allows the user to maximize the analytical outcome of long-read technologies by providing the tools to deliver quality-evaluated and curated full-length transcriptomes.
With increased usage of long-read sequencing technologies to perform transcriptome analyses, there becomes a greater need to evaluate different methodologies including library preparation, sequencing platform, and computational analysis tools. Here, we report the study design of a community effort called the Long-read RNA-Seq Genome Annotation Assessment Project (LRGASP) Consortium, whose goals are characterizing the strengths and remaining challenges in using long-read approaches to identify and quantify the transcriptomes of both model and non-model organisms. The LRGASP organizers have generated cDNA and direct RNA datasets in human, mouse, and manatee samples using different protocols followed by sequencing on Illumina, Pacific Biosciences, and Oxford Nanopore Technologies platforms. Participants will use the provided data to submit predictions for three challenges: transcript isoform detection with a high-quality genome, transcript isoform quantification, and de novo transcript isoform identification. Evaluators from different institutions will determine which pipelines have the highest accuracy for a variety of metrics using benchmarks that include spike-in synthetic transcripts, simulated data, and a set of undisclosed, manually curated transcripts by GENCODE. We also describe plans for experimental validation of predictions that are platform-specific and computational tool-specific. We believe that a community effort to evaluate long-read RNA-seq methods will help move the field toward a better consensus on the best approaches to use for transcriptome analyses.
et al. quality control in full-length transcriptome identification and quantification SQANTI: extensive characterization of long-read transcript sequences for
Alternative splicing contributes to organismal complexity. Comparing transcripts between and within species is an important first step toward understanding questions about how evolution of transcript structure changes between species and contributes to sub-functionalization. These questions are confounded with issues of data quality and availability. The recent explosion of affordable long read sequencing of mRNA has considerably widened the ability to study transcriptional variation in non-model species. In this work, we develop a computational framework that uses nucleotide resolution distance metrics to compare transcript models for structural phenotypes: total transcript length, intron retention, donor/acceptor site variation, alternative exon cassettes, alternative 5'/3' UTRs are each scored qualitatively and quantitatively in terms of number of nucleotides. For a single annotation file, all differences among transcripts within a gene are summarized and transcriptome-level complexity metrics: number of variable nucleotides, unique exons per gene, exons per transcript, and transcripts per gene are calculated. To compare two transcriptomes on the same co-ordinates, a weighted total distance between pairs of transcripts for the same gene is calculated. The weight function proposed has larger penalties for intron retention and exon skipping than alternative donor/acceptor sites. Minimum distances can be used to identify both transcript pairs and transcripts missing structural elements in either of the two annotations. This enables a broad range of functionality from comparing sister species to comparing different methods of building and summarizing transcriptomes. Importantly, the philosophy here is to output metrics, enabling others to explore the nucleotide-level distance metrics. Single transcriptome annotation summaries and pairwise comparisons are implemented in a new tool, TranD, distributed as a PyPi package and in the open-source web-based Galaxy (www.galaxyproject.org) platform.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.