2014
DOI: 10.1371/journal.pone.0101271
|View full text |Cite
|
Sign up to set email alerts
|

These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure

Abstract: K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
98
0
2

Year Published

2014
2014
2022
2022

Publication Types

Select...
5
2
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 106 publications
(100 citation statements)
references
References 40 publications
(88 reference statements)
0
98
0
2
Order By: Relevance
“…For example, one tool for scaling metagenome sequence assembly uses a Bloom filter populated with solid k-mers as a memoryefficient, probabilistic representation of a De Bruijn graph [19]. Other tools use counting Bloom filters [31,32] or the related CountMin sketch [33] to represent De Bruijn graphs for compression [20] or digital normalization and related tasks [34]. We expect ideas from Lighter could be useful in reducing the memory footprint of these and other tools.…”
Section: Discussionmentioning
confidence: 99%
“…For example, one tool for scaling metagenome sequence assembly uses a Bloom filter populated with solid k-mers as a memoryefficient, probabilistic representation of a De Bruijn graph [19]. Other tools use counting Bloom filters [31,32] or the related CountMin sketch [33] to represent De Bruijn graphs for compression [20] or digital normalization and related tasks [34]. We expect ideas from Lighter could be useful in reducing the memory footprint of these and other tools.…”
Section: Discussionmentioning
confidence: 99%
“…By inserting k -mers into a Bloom filter the first time they are observed, and adding them to the higher-overhead exact hash table only upon subsequent observations. Later, Zhang et al (2014) demonstrated that the count-min sketch (Cormode and Muthukrishnan, 2005) (a frequency estimation data structure) can be used to approximately answer k -mer presence and abundance queries when one requires only approximate counts of k -mers in the input. Such approaches can yield order-of-magnitude improvements in memory usage over competing methods.…”
Section: Introduction and Related Workmentioning
confidence: 99%
“…DIPA-based k-mer counting provides rapid and robust microbial community analysis and characterization without the (Jiang et al, 2012) and/or de novo assembly in order to compare and contrast sequence datasets. K-mers are critical to assembly (Li et al, 2015), counting (Zhang et al, 2014), partitioning (Howe et al, 2014), genomic binning (Wu et al, 2015) and classification (Jiang et al, 2012). K-mer based counting is amongst the fastest approaches for profiling metagenomic and/or metatranscriptomic data (Lindgreen et al, 2015).…”
Section: Introductionmentioning
confidence: 99%
“…There are many k-mer counters (Zhang et al, 2014), and even database dependent k-mer profilers (Koslicki and Falush, 2016). MerCat provides only k-mer counting tool for assembled contigs (.fna), translated protein-coding ORFs (.faa) and NGS reads (.fastq) for any size k-mer.…”
Section: Introductionmentioning
confidence: 99%