Although Kraken's k-mer-based approach provides fast taxonomic classification of metagenomic sequence data, its large memory requirements can be limiting for some applications. Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed five-fold. Kraken 2 also introduces a translated search mode, providing increased sensitivity in viral metagenomics analysis.Assigning taxonomic labels to sequencing reads is an important part of many computational genomics pipelines for metagenomics projects. Recent years have seen several approaches to accomplish this task in a time-efficient manner 1-3 . Kraken 4 used a memory-intensive algorithm that associates short genomic substrings (k-mers) with lowest common ancestor (LCA) taxa. Kraken and related tools like KrakenUniq 5 have proven highly efficient and accurate in other tool comparisons 6,7 . But Kraken's high memory requirements force many researchers to either use a reduced-sensitivity MiniKraken database 8,9 , or to build and use many indexes over subsets of the reference sequences 10,11 . Its memory requirements can easily exceed 100 GB 7 , especially when the reference data includes large eukaryotic genomes 12,13 . Here we introduce Kraken 2, which provides a major reduction in memory usage as well as faster classification, a spaced-seed searching scheme, a translated search mode for matching in amino acid space, and continued compatibility with the Bracken 14 species-level quantification algorithm.Kraken 2 addresses the issue of large memory requirements through two changes to Kraken 1's data structures and algorithms. While Kraken 1 used a sorted list of k-mer/LCA pairs indexed by minimizers 15 , Kraken 2 introduces a probabilistic, compact hash table to map minimizers to LCAs. This table uses one-third of the memory of a standard hash table, at the cost of some specificity and accuracy. Additionally, Kraken 2 only stores minimizers (of length ℓ, ℓ ≤ k) from the reference sequence library in the data structure, whereas Kraken 1's stored all k-mers. Kraken 2's index for a reference database consisting of 9.1 Gbp of genomic sequence uses 10.6 gigabytes of memory at classification time. Kraken 1's index for the same reference set uses 72.4 gigabytes of memory for classification (Figure 1a, Supplementary Table S1). In general, a Kraken 2 database is about 15% as large as a Kraken 1 database over the same references (Supplementary Figure S1).Kraken 2's approach is faster than Kraken 1's because only the distinct minimizers from the query (read) trigger accesses to the hash table. A similar minimizer-based approach has proven useful in accelerating read alignment 16 . Kraken 2 additionally provides a hash-based filtering approach that subsamples the set of minimizer/LCA pairs included in the table, allowing the user to specify a target hash table size; smaller hash tables yield lower memory footprint and higher classification throughput at the expens...
As sequencing throughput approaches dozens of gigabases per day, there is a growing need for efficient software for analysis of transcriptome sequencing (RNA-Seq) data. Myrna is a cloud-computing pipeline for calculating differential gene expression in large RNA-Seq datasets. We apply Myrna to the analysis of publicly available data sets and assess the goodness of fit of standard statistical models. Myrna is available from http://bowtie-bio.sf.net/myrna.
We present recount3, a resource consisting of over 750,000 publicly available human and mouse RNA sequencing (RNA-seq) samples uniformly processed by our new analysis pipeline. To facilitate access to the data, we provide the and R/Bioconductor packages as well as complementary web resources. Using these tools, data can be downloaded as study-level summaries or queried for specific exon-exon junctions, genes, samples, or other features. can be used to process local and/or private data, allowing results to be directly compared to any study in recount3. Taken together, our tools help biologists maximize the utility of publicly available RNA-seq data, especially to improve their understanding of newly collected data. recount3 is available from http://rna.recount.bio.
We present recount3, a resource consisting of over 750,000 publicly available human and mouse RNA sequencing (RNA-seq) samples uniformly processed by our new Monorail analysis pipeline. To facilitate access to the data, we provide the recount3 and snapcount R/Bioconductor packages as well as complementary web resources. Using these tools, data can be downloaded as study-level summaries or queried for specific exon-exon junctions, genes, samples, or other features. Monorail can be used to process local and/or private data, allowing results to be directly compared to any study in recount3. Taken together, our tools help biologists maximize the utility of publicly available RNA-seq data, especially to improve their understanding of newly collected data. recount3 is available from http://rna.recount.bio.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.