KrakenUniq: confident and fast metagenomics classification using unique k-mer counts

Breitwieser, Florian P.; Baker, David; Salzberg, Steven L.

doi:10.1186/s13059-018-1568-0

Cited by 348 publications

(355 citation statements)

References 30 publications

Supporting

Mentioning

351

Contrasting

Unclassified

Order By: Relevance

“…Ten samples from each of MSBB_WES (syn7541077), MSBB_RNA (syn8612191) and MAYO_TCX (syn8612203) datasets with the greatest number of reported HHV6A (n=5/dataset) and HHV7 reads (n=5/dataset) were selected for further analysis (n=30 total). Raw reads were preprocessed with fastp, then taxonomically categorized using KrakenUniq, a fast yet highly sensitive method based on k-mers (Breitwieser et al, 2018;Chen et al, 2018). KrakenUniq identified a total of 13 HHV6A reads in 2/15 top HHV6A samples (Readhead total: 75 reads), and failed to identify any HH7 reads in the top HHV7 subset (Readhead total: 93 reads in 15 samples).…”

Section: Methods and Resultsmentioning

confidence: 99%

Reanalysis of Alzheimer’s Brain Sequencing Data Reveals Absence of Purported HHV6A and HHV7

Chorlton

2019

Preprint

View full text Add to dashboard Cite

Readhead et al. recently reported in Neuron the detection and association of Human Herpesviruses 6A (HHV6A) and 7 (HHV7) with Alzheimer's disease by shotgun sequencing. I was skeptical of the specificity of their modified Viromescan bioinformatics method and subsequent analysis for numerous reasons. In their supplementary data, they report the detection of Variola virus, the etiological agent of the eradicated disease smallpox, in 97.5% of their Mount Sinai Brain Bank dataset. I reanlyzed Readhead et al's data using highly sensitive and specific alternative methods and find no HHV7 reads in their samples; HHV6A reads were found in only 2 out of their top 15 samples sorted by reported HHV6A abundance. Finally, I recreate Readhead et al's modified Viromescan method and identify reasons for its low specificity.

show abstract

Section: Methods and Resultsmentioning

confidence: 99%

Reanalysis of Alzheimer’s Brain Sequencing Data Reveals Absence of Purported HHV6A and HHV7

Chorlton

2019

Preprint

View full text Add to dashboard Cite

show abstract

“…We compared ganon against kraken (Wood and Salzberg, 2014), one of the most used k-mer based methods for metagenomics short read classification and its newer version, kraken2 (Wood et al, 2019). We also included krakenuniq (Breitwieser et al, 2018), which uses the basic kraken algorithm and also allows classification on more specific levels after taxonomic assignments (e.g. up to assembly or sequence level).…”

Section: Resultsmentioning

confidence: 99%

ganon: precise metagenomics classification against large and up-to-date sets of reference sequences

Piro

Dadi

Seiler

et al. 2018

Preprint

View full text Add to dashboard Cite

11The exponential growth of assembled genome sequences greatly benets metagenomics 12 studies, providing a broader catalog of reference organisms on a variety of environments. 13 However, currently available methods struggle to manage the increasing amount of sequences 14 and their frequent updates. Indexing the current RefSeq is no longer possible on standard 15 infrastructures and it can take days and hundreds of GB of memory on large servers. Few 16 methods address these issues thus far, and even though many can theoretically handle large 17 amounts of references, time/memory requirements are prohibitive in practice. As a result, 18 many studies that require sequence classication use the available tools in conjunction with 19 often outdated and almost never truly up-to-date indices. This also means that the taxonomic 20 composition of the reference database is not being adjusted based on the study performed. 21 These factors can lead to unnecessary performance problems in the sequence classication. 22 Motivated by those limitations we created ganon, a k-mer based read classication tool that 23 uses Interleaved Bloom Filters in conjunction with a taxonomic clustering and a k-mer count-24 ing/ltering scheme. Ganon provides an ecient method for indexing references, keeping them 25 updated. It requires less than 55 minutes to index the complete RefSeq of bacteria, archaea, 26 fungi and viruses. The tool can further keep these indices up-to-date in a fraction of the time 27 necessary to create them, allowing researchers to always work with the most recent references. 28 Ganon makes it possible to query against very large reference sets and therefore it classies 29 signicantly more reads and identies more species than similar methods. When classifying a 30 high-complexity real dataset from the CAMI challenge against complete genomes from RefSeq, 31 ganon shows strongly increased precision while exhibiting equal or better sensitivity compared 32 with state-of-the-art tools. When classifying the same dataset against the complete RefSeq, 33 ganon improved the F1-Score by 65% at the genus level. Ganon supports taxonomy-and 34 assembly-level classication as well as multiple indices and hierarchical classication. The 35 software is open-source and available at: https://gitlab.com/rki_bioinformatics/ganon 36 42 prede(ned nd stti set of referene sequenesF wny of those pprohes re txonomyEsed 43 U nd use this lssi(tion to etter understnd the omposition of smplesF 44 * PiroV@rki.de † RenardB@rki.de I he mount of omplete or drft genomi sequenes in puli repositories is rpidly growing 45 @pigure IA due to dvnes in genome sequeningD improvements in red qulityD length nd ovE 46 erge nd lso etter lgorithms for genome ssemlyF sn dditionD mny prtil nd omplete 47

show abstract

“…Data visualization. We created the Sankey plots using the krakenuniq-report tool from KrakenUniq [30] to create a Kraken-style report from our predicted contaminations. The visualization was done using Pavian [31] extracted as SVG and colored by Inkscape.…”

Section: Methodsmentioning

confidence: 99%

Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank

Steinegger

Salzberg

2020

Preprint

View full text Add to dashboard Cite

Metagenomic sequencing allows researchers to investigate organisms sampled from their native environments by sequencing their DNA directly, and then quantifying the abundance and taxonomic composition of the organisms thus captured. However, these types of analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here we describe Conterminator, an efficient method to detect and remove incorrectly labelled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination in 114,035 sequences and 2767 species in the NCBI Reference Sequence Database (RefSeq), 2,161,746 sequences and 6795 species in the GenBank database, and 14,132 protein sequences in the NR non-redundant protein database. Conterminator uncovers contamination in sequences spanning the whole range from draft genomes to “complete” model organism genomes. Our method, which scales linearly with input size, was able to process 3.3 terabytes of genomic sequence data in 12 days on a single 32-core compute node. We believe that Conterminator can become an important tool to ensure the quality of reference databases with particular importance for downstream metagenomic analyses. Source code (GPLv3): https://github.com/martin-steinegger/conterminator

show abstract

KrakenUniq: confident and fast metagenomics classification using unique k-mer counts

Cited by 348 publications

References 30 publications

Reanalysis of Alzheimer’s Brain Sequencing Data Reveals Absence of Purported HHV6A and HHV7

Reanalysis of Alzheimer’s Brain Sequencing Data Reveals Absence of Purported HHV6A and HHV7

ganon: precise metagenomics classification against large and up-to-date sets of reference sequences

Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank

Contact Info

Product

Resources

About