Mash: fast genome and metagenome distance estimation using MinHash

Ondov, Brian D.; Treangen, Todd J.; Melsted, Páll; Mallonee, Adam B.; Bergman, Nicholas H.; Koren, Sergey; Phillippy, Adam M.

doi:10.1186/s13059-016-0997-x

Cited by 2,471 publications

(2,338 citation statements)

References 49 publications

Supporting

Mentioning

2,325

Contrasting

Unclassified

Order By: Relevance

“…Finally, we used k-mer distances 30 , mash 28 and andi 29 to create distance matrices. andi counts the number of mismatches between equally spaced maximal exact matches between a pair of sequences.…”

Section: Resultsmentioning

confidence: 99%

Evaluation of phylogenetic reconstruction methods using bacterial whole genomes: a simulation based study

Lees¹,

Kendall²,

Parkhill³

et al. 2018

Wellcome Open Res

View full text Add to dashboard Cite

Background: Phylogenetic reconstruction is a necessary first step in many analyses which use whole genome sequence data from bacterial populations. There are many available methods to infer phylogenies, and these have various advantages and disadvantages, but few unbiased comparisons of the range of approaches have been made. Methods: We simulated data from a defined “true tree” using a realistic evolutionary model. We built phylogenies from this data using a range of methods, and compared reconstructed trees to the true tree using two measures, noting the computational time needed for different phylogenetic reconstructions. We also used real data from Streptococcus pneumoniae alignments to compare individual core gene trees to a core genome tree. Results: We found that, as expected, maximum likelihood trees from good quality alignments were the most accurate, but also the most computationally intensive. Using less accurate phylogenetic reconstruction methods, we were able to obtain results of comparable accuracy; we found that approximate results can rapidly be obtained using genetic distance based methods. In real data we found that highly conserved core genes, such as those involved in translation, gave an inaccurate tree topology, whereas genes involved in recombination events gave inaccurate branch lengths. We also show a tree-of-trees, relating the results of different phylogenetic reconstructions to each other. Conclusions: We recommend three approaches, depending on requirements for accuracy and computational time. Quicker approaches that do not perform full maximum likelihood optimisation may be useful for many analyses requiring a phylogeny, as generating a high quality input alignment is likely to be the major limiting factor of accurate tree topology. We have publicly released our simulated data and code to enable further comparisons.

show abstract

Section: Resultsmentioning

confidence: 99%

Evaluation of phylogenetic reconstruction methods using bacterial whole genomes: a simulation based study

Lees¹,

Kendall²,

Parkhill³

et al. 2018

Wellcome Open Res

View full text Add to dashboard Cite

show abstract

“…The updated MHAP version also implements bottom sketching for the second-stage filter (Ondov et al 2016). In contrast to the first-stage filter, which uses multiple hash functions (Broder et al 2000), bottom sketching uses a single hash function from which the s minimum values are retained as the sketch (Broder 1997).…”

Section: Minhash Overlappingmentioning

confidence: 99%

Canu: scalable and accurate long-read assembly via adaptivek-mer weighting and repeat separation

Koren

Walenz

Berlin³

et al. 2017

Genome Res.

Self Cite

6,088

4,185

View full text Add to dashboard Cite

Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences (PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of >21 Mbp on both human and Drosophila melanogaster PacBio data sets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.

show abstract

“…Collectively these n smallest values ("minmers") comprise a "sketch" of the input sample. By default, previous MinHash implementations for genomics data work by creating sketches from all k-mers from an input genomic dataset (though the original Mash tool does enable filtering out k-mers that appear only once using a Bloom filter (Ondov et al 2016)). While this works well for high-quality sequences such as genome assemblies (i.e., FASTA files), it quickly becomes problematic when working with raw FASTQ data where errors from NGS instruments can lead to a far larger number of unique observed k-mers than are truly present biologically.…”

Section: Resultsmentioning

confidence: 99%

“…MinHash (Broder 1997) is a document similarity estimation technique that has been applied to problems in genomics including sequence search, phylogenetic reconstruction (Ondov et al 2016;Brown and Irber 2016), and evaluating outbreaks of hospital acquired infections (HAIs) (Sim et al 2017). We developed the finch-rs library (https: //github.com/onecodex/finch-rs) and finch command line tool for creating, filtering, and manipulating MinHash sketches from genomics data, including both FASTA sequence files and FASTQ raw read data from next-generation sequencing (NGS) instruments.…”

Section: Resultsmentioning

confidence: 99%