RecSplit: Minimal Perfect Hashing via Recursive Splitting

Emmanuel, Esposito,; Graf, Thomas Mueller; Vigna, Sebastiano

doi:10.1137/1.9781611976007.14

Cited by 18 publications

(22 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the experimental part of this work (Section 4), we show that when applied to the k-mer counting, the error of Count-Min may not be acceptable. The construction of a MPHF can be hyper-graph peeling-based [16,17] or array-based [18]. The first family of algorithms leads to smaller MPHFs, close to the theoretical space lower-bound of 1.44 bits per key, while array-based MPHFs are more cache friendly and much easier conceptually despite being less memory efficient than their mainstream counterparts.…”

Section: K-mer Spectrummentioning

confidence: 99%

See 1 more Smart Citation

Set-Min sketch: a probabilistic map for power-law distributions with application tok-mer annotation

Shibuya

Belazzougui

Kucherov

2020

Preprint

View full text Add to dashboard Cite

Motivation: In many bioinformatics pipelines, k-mer counting is often a required step, with existing methods focusing on optimizing time or memory usage. These methods usually produce very large count tables explicitly representing k-mers themselves. Solutions avoiding explicit representation of k-mers include Minimal Perfect Hash Functions (MPHFs) or Count-Min sketches. The former is only applicable to static maps not subject to updates, while the latter suffers from potentially very large point-query errors, making it unsuitable when counters are required to be highly accurate. Results: We introduce Set-Min sketch, a sketching technique inspired by Count-Min sketch, for representing associative maps, more specifically, k-mer count tables. We show that Set-Min sketch provides a very low error rate, both in terms of the probability and the size of errors, much lower than a Count-Min sketch of similar dimensions. On the other hand, Set-Min sketches are shown to take up to an order of magnitude less space than MPHF-based solutions, especially for large values of k. Space-efficiency of Set-min takes advantage of the power-law distribution of k-mer counts in genomic datasets.

show abstract

Section: K-mer Spectrummentioning

confidence: 99%

“…The construction of MPHFs can be hyper-graph peeling-based [19,20] or array-based [21]. The first family of algorithms leads to smaller MPHFs, close to theoretical space lower-bound of 1.44 bits per key, while array-based MPHFs are conceptually simpler and have practical implementations for k-mer sets, such as BBHash [22].…”

Section: Minimal Perfect Hashingmentioning

confidence: 99%

Set-Min sketch: a probabilistic map for power-law distributions with application tok-mer annotation

Shibuya

Belazzougui

Kucherov

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Hence, recent efforts have been made to use minimal perfect hash functions (MPHFs) [10,18,26] for in-memory key-value lookups, which significantly reduce the space cost by avoiding storing keys. For a set of n key-value items where each item is a tuple (k i , v i ) of key k i and value v i , a minimal perfect hash function H ′ maps the n keys to integers 0 to n − 1 without collision.…”

mentioning

confidence: 99%

“…Step ii): For each group we find a hash function H such that H maps the four keys to integers 0 to 3 without collision. For most modern random hash function algorithms, we may generate an independent hash function H s by using a [18] Not allowed…”

mentioning

confidence: 99%

See 1 more Smart Citation

Ludo Hashing

Shi

Qian

2020

Proc. ACM Meas. Anal. Comput. Syst.

View full text Add to dashboard Cite

Key-value lookup engines running in fast memory are crucial components of many networked and distributed systems such as packet forwarding, virtual network functions, content distribution networks, distributed storage, and cloud/edge computing. These lookup engines must be memory-efficient because fast memory is small and expensive. This work presents a new key-value lookup design, called Ludo Hashing, which costs the least space (3.76 + 1.05l bits per key-value item for l-bit values) among known compact lookup solutions including the recently proposed partial-key Cuckoo and Bloomier perfect hashing. In addition to its space efficiency, Ludo Hashing works well with most practical systems by supporting fast lookups, fast updates, and concurrent writing/reading. We implement Ludo Hashing and evaluate it with both micro-benchmark and two network systems deployed in CloudLab. The results show that in practice Ludo Hashing saves 40% to 80%+ memory cost compared to existing dynamic solutions. It costs only a few GB memory for 1 billion key-value items and achieves high lookup throughput: over 65 million queries per second on a single node with multiple threads.

show abstract