CONSULT: Accurate contamination removal using locality-sensitive hashing

Rachtman, Eleonora; Bafna, Vineet; Mirarab, Siavash

doi:10.1101/2021.03.18.436035

Cited by 2 publications

(9 citation statements)

References 111 publications

(123 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We also constructed the CLARK database using the standard parameters, e.g., k=31, default classification mode, species rank for classification. Note that following Rachtman et al (2021), 100 archaeal genomes were left out from the reference and used as part of the query set.…”

Section: Methodsmentioning

confidence: 99%

“…We created two sets of queries: bacterial and archaeal. The 100 archaeal queries were chosen by Rachtman et al (2021) and were excluded from the reference set. For the bacterial set, we selected a set of 120 bacterial genomes among genomes added to RefSeq after ToL was constructed.…”

Section: Methodsmentioning

confidence: 99%

“…Query genomes span 29 phyla, and most queries are from distinct genera (102 genera across 120 queries); only two query genomes belong to the same species. The 100 archaeal queries were chosen by Rachtman et al (2021) from WoL set using a similar approach and were excluded from the reference set. We generated 150bp synthetic reads using ART (Huang et al, 2012) at higher coverage, and then subsampled down to 66667 reads for each query (i.e., 10Mbp per sample).…”

Section: Experiments 1: Controlled Noveltymentioning

confidence: 99%

“…Kraken-II achieves this by masking some positions in a k-mer (default: 7 out of 31). Recently, Rachtman et al (2021) showed that novel reads (e.g., those with 10-15% distance to the closest match) can be identified with higher accuracy by making inexact matches a central feature of the search. The resulting method, CONSULT, uses locality-sensitive hashing (LSH) to partition k-mers in the reference set into fixed-size buckets such that for a given k-mer, the reference k-mers with distance up to a certain threshold are within pre-determined buckets with high probability.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

CONSULT-II: Accurate taxonomic identification and profiling using locality-sensitive hashing

Şapcı,

Rachtman,

Mirarab

2023

Preprint

Self Cite

View full text Add to dashboard Cite

Taxonomic classification of metagenomic reads is a well-studied yet challenging problem. Identifying species belonging to ranks without close representation in a reference dataset are in particular challenging. While k-mer-based methods have performed well in terms of running time and accuracy, they have reduced accuracy for novel species. Here, we show that using locality-sensitive hashing (LSH) can increase the sensitivity of the k-mer-based search. Our method, which combines LSH with several heuristics techniques including soft LCA labeling and voting is more accurate than alternatives in both taxonomic classification and profiling.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Experiments 1: Controlled Noveltymentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

CONSULT-II: Accurate taxonomic identification and profiling using locality-sensitive hashing

Şapcı,

Rachtman,

Mirarab

2023

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…In extracting k-mers from all input genomes, we use minimizers to perform an initial subsampling by choosing the k-mer whose encoding has a MurmurHash3 (Appleby, 2009) value that is the smallest in a local window of size w = k + 3. We adopt the left/right encoding of k-mers introduced by Rachtman, Bafna, et al (2021); this encoding allows fast calculation of Hamming distance using just four instructions (a pop-count, an XOR, an OR, and a shift). For k > 16, these encodings require 64 bits.…”

Section: Extractkmersmentioning

confidence: 99%

Memory-boundk-mer selection for large and evolutionary diverse reference libraries

Şapcı,

Mirarab

2024

Preprint

Self Cite

View full text Add to dashboard Cite

Using longk-mers to find sequence matches is increasingly used in many bioinformatic applications, including metagenomic sequence classification. The accuracy of these downstream applications relies on the density of the reference databases, which, luckily, are rapidly growing. While the increased density provides hope for dramatic improvements in accuracy, scalability is a concern. Thek-mers are kept in the memory during the query time, and saving allk-mers of these ever-expanding databases is fast becoming impractical. Several strategies for subsamplingk-mers have been proposed, including minimizers and finding taxon-specifick-mers. However, we contend that these strategies are inadequate, especially when reference sets are taxonomically imbalanced, as are most microbial libraries. In this paper, we specifically ask the question: Given limited memory, what is the best strategy to select a subset ofk-mers from an ultra-large dataset to include in a library such that the classification of reads suffers the least? We explore strategies to achieve this goal and show a set of experiments demonstrating the limitations of existing approaches, especially for novel and poorly sampled groups. We propose a library construction algorithm called KRANK (K-mer RANKer) that combines several components, including a hierarchical selection strategy with adaptive size restrictions and an equitable coverage strategy. We implement KRANK in highly optimized code and combine it with the locality-sensitive-hashing classifier CONSULT-II. Our method is able to reduce the memory consumption from roughly 140Gb down to 6, 12, or 24Gb, with only a 3.8%, 2.5%, or 0.5% loss in the F1 score. We show in extensive analyses that KRANK outperforms alternatives in both taxonomic classification and taxonomic profiling, using reasonable memory sizes.Code availabilityThe implementation is available athttps://github.com/bo1929/KRANK.

show abstract

CONSULT: Accurate contamination removal using locality-sensitive hashing

Cited by 2 publications

References 111 publications

CONSULT-II: Accurate taxonomic identification and profiling using locality-sensitive hashing

CONSULT-II: Accurate taxonomic identification and profiling using locality-sensitive hashing

Memory-boundk-mer selection for large and evolutionary diverse reference libraries

Contact Info

Product

Resources

About