2019
DOI: 10.1101/687285
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Large-scale sequence comparisons with sourmash

Abstract: The sourmash software package uses MinHash-based sketching to create "signatures", compressed representations of DNA, RNA, and protein sequences, that can be stored, searched, explored, and taxonomically annotated. sourmash signatures can be used to estimate sequence similarity between very large data sets quickly and in low memory, and can be used to search large databases of genomes for matches to query genomes and metagenomes. sourmash is implemented in C++, Rust, and Python, and is freely available under t… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
46
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 29 publications
(48 citation statements)
references
References 19 publications
0
46
0
Order By: Relevance
“…Genomes corresponding to novel species were identified as those with identity <95% or coverage <80% when compared with known genomes (BLAST with nt) and three recent catalogs that include environmental and human microbiome assembled genomes [31][32][33] (with Mash 68 ). The genomes were hierarchically clustered (single linkage with Mash distance 68 ) to identify species-level clusters at 95% identity, and genus-level taxonomic classification was obtained using sourmash 69 . Similarly, novel circular plasmids were identified by comparing to the PLSDB 36 database with Mash distance and identifying clusters at 99% identity (single linkage) with no known sequence.…”
Section: © the Author(s) 2020mentioning
confidence: 99%
“…Genomes corresponding to novel species were identified as those with identity <95% or coverage <80% when compared with known genomes (BLAST with nt) and three recent catalogs that include environmental and human microbiome assembled genomes [31][32][33] (with Mash 68 ). The genomes were hierarchically clustered (single linkage with Mash distance 68 ) to identify species-level clusters at 95% identity, and genus-level taxonomic classification was obtained using sourmash 69 . Similarly, novel circular plasmids were identified by comparing to the PLSDB 36 database with Mash distance and identifying clusters at 99% identity (single linkage) with no known sequence.…”
Section: © the Author(s) 2020mentioning
confidence: 99%
“…Detection and classification of 16S rRNA gene fragments were performed with SortMeRNA (50) and the RDP classifier (51). K-mer profiling was performed with sourmash (52,53). All statistical analyses were done in R using the vegan To profile the viral communities in these 16 samples and to compare two common techniques for viral community analyses, paired total metagenomes and viral size-fraction metagenomes (viromes) were generated from each sample (with the exception of one April sample, PN-L, that did not yield enough DNA to perform virome sequencing).…”
Section: Read Processing and Data Analysismentioning
confidence: 99%
“…Kmer-db reports even faster compute times than Mash, largely through an improved k-mer hash and parallel implementation (Deorowicz et al 2019). sourmash adds some functionality to Mash and implements scaled, rather than thresholded, numbers of retained kmer hashes (Pierce et al 2019). This method can enable comparisons of datasets that differ greatly in size, but slows down analysis (Pierce et al 2019).…”
Section: Comparison With Other Alignment-free Methodsmentioning
confidence: 99%
“…sourmash adds some functionality to Mash and implements scaled, rather than thresholded, numbers of retained kmer hashes (Pierce et al 2019). This method can enable comparisons of datasets that differ greatly in size, but slows down analysis (Pierce et al 2019). We checked the performance of these other tools against Mash for the simulated data in this study, but neither sourmash nor kmer-db showed advantage over Mash.…”
Section: Comparison With Other Alignment-free Methodsmentioning
confidence: 99%