2016
DOI: 10.1186/s13059-016-0997-x
|View full text |Cite
|
Sign up to set email alerts
|

Mash: fast genome and metagenome distance estimation using MinHash

Abstract: Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections. Mash reduces large sequences and sequence sets to small, representative sketches, from which global mutation distances can be rapidly estimated. We demonstrate several use cases, including the clustering of all 54,118 NCBI RefSeq genomes in 33 CPU h; real-time database search using assembled or unassemble… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

11
2,325
0
2

Year Published

2017
2017
2022
2022

Publication Types

Select...
4
4
1

Relationship

1
8

Authors

Journals

citations
Cited by 2,471 publications
(2,338 citation statements)
references
References 49 publications
11
2,325
0
2
Order By: Relevance
“…Finally, we used k-mer distances 30 , mash 28 and andi 29 to create distance matrices. andi counts the number of mismatches between equally spaced maximal exact matches between a pair of sequences.…”
Section: Resultsmentioning
confidence: 99%
“…Finally, we used k-mer distances 30 , mash 28 and andi 29 to create distance matrices. andi counts the number of mismatches between equally spaced maximal exact matches between a pair of sequences.…”
Section: Resultsmentioning
confidence: 99%
“…The updated MHAP version also implements bottom sketching for the second-stage filter (Ondov et al 2016). In contrast to the first-stage filter, which uses multiple hash functions (Broder et al 2000), bottom sketching uses a single hash function from which the s minimum values are retained as the sketch (Broder 1997).…”
Section: Minhash Overlappingmentioning
confidence: 99%
“…Collectively these n smallest values ("minmers") comprise a "sketch" of the input sample. By default, previous MinHash implementations for genomics data work by creating sketches from all k-mers from an input genomic dataset (though the original Mash tool does enable filtering out k-mers that appear only once using a Bloom filter (Ondov et al 2016)). While this works well for high-quality sequences such as genome assemblies (i.e., FASTA files), it quickly becomes problematic when working with raw FASTQ data where errors from NGS instruments can lead to a far larger number of unique observed k-mers than are truly present biologically.…”
Section: Resultsmentioning
confidence: 99%
“…MinHash (Broder 1997) is a document similarity estimation technique that has been applied to problems in genomics including sequence search, phylogenetic reconstruction (Ondov et al 2016;Brown and Irber 2016), and evaluating outbreaks of hospital acquired infections (HAIs) (Sim et al 2017). We developed the finch-rs library (https: //github.com/onecodex/finch-rs) and finch command line tool for creating, filtering, and manipulating MinHash sketches from genomics data, including both FASTA sequence files and FASTQ raw read data from next-generation sequencing (NGS) instruments.…”
Section: Resultsmentioning
confidence: 99%