2021
DOI: 10.1101/2021.03.18.436035
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

CONSULT: Accurate contamination removal using locality-sensitive hashing

Abstract: A fundamental question appears in many bioinformatics applications: Does a sequencing read belong to a large dataset of genomes from some broad taxonomic group, even when the closest match in the set is evolutionarily divergent from the query? For example, low-coverage genome sequencing (skimming) projects either assemble the organelle genome or compute genomic distances directly from unassembled reads. Using unassembled reads needs contamination detection because samples often include reads from unintended gr… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2

Relationship

2
0

Authors

Journals

citations
Cited by 2 publications
(9 citation statements)
references
References 111 publications
(123 reference statements)
0
9
0
Order By: Relevance
“…We also constructed the CLARK database using the standard parameters, e.g., k=31, default classification mode, species rank for classification. Note that following Rachtman et al (2021), 100 archaeal genomes were left out from the reference and used as part of the query set.…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…We also constructed the CLARK database using the standard parameters, e.g., k=31, default classification mode, species rank for classification. Note that following Rachtman et al (2021), 100 archaeal genomes were left out from the reference and used as part of the query set.…”
Section: Methodsmentioning
confidence: 99%
“…We created two sets of queries: bacterial and archaeal. The 100 archaeal queries were chosen by Rachtman et al (2021) and were excluded from the reference set. For the bacterial set, we selected a set of 120 bacterial genomes among genomes added to RefSeq after ToL was constructed.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…In extracting k-mers from all input genomes, we use minimizers to perform an initial subsampling by choosing the k-mer whose encoding has a MurmurHash3 (Appleby, 2009) value that is the smallest in a local window of size w = k + 3. We adopt the left/right encoding of k-mers introduced by Rachtman, Bafna, et al (2021); this encoding allows fast calculation of Hamming distance using just four instructions (a pop-count, an XOR, an OR, and a shift). For k > 16, these encodings require 64 bits.…”
Section: Extractkmersmentioning
confidence: 99%