2019
DOI: 10.12688/f1000research.19675.1
|View full text |Cite
|
Sign up to set email alerts
|

Large-scale sequence comparisons with sourmash

Abstract: The sourmash software package uses MinHash-based sketching to create “signatures”, compressed representations of DNA, RNA, and protein sequences, that can be stored, searched, explored, and taxonomically annotated. sourmash signatures can be used to estimate sequence similarity between very large data sets quickly and in low memory, and can be used to search large databases of genomes for matches to query genomes and metagenomes. sourmash is implemented in C++, Rust, and Python, and is freely available under t… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
156
0
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 167 publications
(166 citation statements)
references
References 21 publications
0
156
0
1
Order By: Relevance
“…The UKMCC1015 strain has two plasmids: UKMCC1015_2 (7k bp) and UKMCC1015_3 (2k bp). We performed sequence comparison using sourmash (Pierce et al . 2019) against the PLSDB database (Galata et al .…”
Section: Resultsmentioning
confidence: 99%
“…The UKMCC1015 strain has two plasmids: UKMCC1015_2 (7k bp) and UKMCC1015_3 (2k bp). We performed sequence comparison using sourmash (Pierce et al . 2019) against the PLSDB database (Galata et al .…”
Section: Resultsmentioning
confidence: 99%
“…Automated genome assembly was performed with the tool Automatic Assembly For The Fungi (AAFTF) which performs read trimming and filtering against PhiX and other contaminants using BBMap v38.16 followed by genome assembly with SPAdes v3.13.1 (Bankevich et al 2012;Bushnell 2014;Stajich 2018. Assemblies were further cleaned of vector sequences, screened for contaminant bacteria with sourmash using database Genbank Microbes 2018.03.29 (Brown and Irber 2016;Pierce et al 2019). Duplicated small contigs were removed using minimap2 v2.17 alignment of contigs smaller than the assembly N50 (Li 2018).…”
Section: Low-coverage Genome (Lcg) Sequence Analysismentioning
confidence: 99%
“…The minhash techniques have been proven to be efficient for estimating similarity in many applications that involve large datasets [11]. They have also been used for many bioinformatics applications especially for analyzing large-scale sequencing datasets [14,18]. Our algorithm makes use of LSH-based bucketing from minhash signatures to cluster the isoform sequences.…”
Section: Resultsmentioning
confidence: 99%