These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure

Qingpeng, Zhang; Pell, Jason; Canino-Koning, Rosangela; Howe, Adina; Brown, C. Titus

doi:10.1371/journal.pone.0101271

Cited by 106 publications

(100 citation statements)

References 40 publications

(88 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For example, one tool for scaling metagenome sequence assembly uses a Bloom filter populated with solid k-mers as a memoryefficient, probabilistic representation of a De Bruijn graph [19]. Other tools use counting Bloom filters [31,32] or the related CountMin sketch [33] to represent De Bruijn graphs for compression [20] or digital normalization and related tasks [34]. We expect ideas from Lighter could be useful in reducing the memory footprint of these and other tools.…”

Section: Discussionmentioning

confidence: 99%

Lighter: fast and memory-efficient sequencing error correction without counting

2014

View full text Add to dashboard Cite

Lighter is a fast, memory-efficient tool for correcting sequencing errors. Lighter avoids counting k-mers. Instead, it uses a pair of Bloom filters, one holding a sample of the input k-mers and the other holding k-mers likely to be correct. As long as the sampling fraction is adjusted in inverse proportion to the depth of sequencing, Bloom filter size can be held constant while maintaining near-constant accuracy. Lighter is parallelized, uses no secondary storage, and is both faster and more memory-efficient than competing approaches while achieving comparable accuracy.

show abstract

Section: Discussionmentioning

confidence: 99%

Lighter: fast and memory-efficient sequencing error correction without counting

2014

View full text Add to dashboard Cite

show abstract

“…By inserting k -mers into a Bloom filter the first time they are observed, and adding them to the higher-overhead exact hash table only upon subsequent observations. Later, Zhang et al (2014) demonstrated that the count-min sketch (Cormode and Muthukrishnan, 2005) (a frequency estimation data structure) can be used to approximately answer k -mer presence and abundance queries when one requires only approximate counts of k -mers in the input. Such approaches can yield order-of-magnitude improvements in memory usage over competing methods.…”

Section: Introduction and Related Workmentioning

confidence: 99%

deBGR: an efficient and near-exact representation of the weighted de Bruijn graph

et al. 2017

View full text Add to dashboard Cite

MotivationAlmost all de novo short-read genome and transcriptome assemblers start by building a representation of the de Bruijn Graph of the reads they are given as input. Even when other approaches are used for subsequent assembly (e.g. when one is using ‘long read’ technologies like those offered by PacBio or Oxford Nanopore), efficient k-mer processing is still crucial for accurate assembly, and state-of-the-art long-read error-correction methods use de Bruijn Graphs. Because of the centrality of de Bruijn Graphs, researchers have proposed numerous methods for representing de Bruijn Graphs compactly. Some of these proposals sacrifice accuracy to save space. Further, none of these methods store abundance information, i.e. the number of times that each k-mer occurs, which is key in transcriptome assemblers.ResultsWe present a method for compactly representing the weighted de Bruijn Graph (i.e. with abundance information) with essentially no errors. Our representation yields zero errors while increasing the space requirements by less than 18–28% compared to the approximate de Bruijn graph representation in Squeakr. Our technique is based on a simple invariant that all weighted de Bruijn Graphs must satisfy, and hence is likely to be of general interest and applicable in most weighted de Bruijn Graph-based systems.Availability and implementation https://github.com/splatlab/debgr.Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

“…DIPA-based k-mer counting provides rapid and robust microbial community analysis and characterization without the (Jiang et al, 2012) and/or de novo assembly in order to compare and contrast sequence datasets. K-mers are critical to assembly (Li et al, 2015), counting (Zhang et al, 2014), partitioning (Howe et al, 2014), genomic binning (Wu et al, 2015) and classification (Jiang et al, 2012). K-mer based counting is amongst the fastest approaches for profiling metagenomic and/or metatranscriptomic data (Lindgreen et al, 2015).…”

Section: Introductionmentioning

confidence: 99%

“…There are many k-mer counters (Zhang et al, 2014), and even database dependent k-mer profilers (Koslicki and Falush, 2016). MerCat provides only k-mer counting tool for assembled contigs (.fna), translated protein-coding ORFs (.faa) and NGS reads (.fastq) for any size k-mer.…”

Section: Introductionmentioning

confidence: 99%

MerCat: a versatile k-mer counter and diversity estimator for database-independent property analysis obtained from metagenomic and/or metatranscriptomic sequencing data

White

Panyala

Glass

et al. 2017

Preprint

View full text Add to dashboard Cite

Summary: MerCat ("Mer -Catenate") is a parallel, highly scalable and modular property software package for robust analysis of features in next-generation sequencing data. Using assembled contigs and raw sequence reads from any platform as input, MerCat performs k-mer counting of any length k, resulting in feature abundance counts tables. MerCat allows for direct analysis of data properties without reference sequence database dependency commonly used by search tools such as BLAST for compositional analysis of whole community shotgun sequencing (e.g., metagenomes and metatranscriptomes). Availability and implementation:MerCat is written in Python and distributed under a BSD license. The source code of MerCat is freely available at https://github.com/pnnl/m ercat MerCat is compatible with Python 2 and 3 and works on both Mac OS X and Linux. MerCat can also be easily installed using bioconda: conda install mercat

show abstract

These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure

Cited by 106 publications

References 40 publications

Lighter: fast and memory-efficient sequencing error correction without counting

Lighter: fast and memory-efficient sequencing error correction without counting

deBGR: an efficient and near-exact representation of the weighted de Bruijn graph

MerCat: a versatile k-mer counter and diversity estimator for database-independent property analysis obtained from metagenomic and/or metatranscriptomic sequencing data

Contact Info

Product

Resources

About