Fast and powerful hashing using tabulation

Thorup, Mikkel

doi:10.1145/3068772

Cited by 9 publications

(13 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A separate hash function is used for each row of the CMS with a range equal to the range of columns. In this work, we use tabulation hashing [18] which has been recently analyzed by Patrascu and Thorup et al [13,16] and shown to provide strong statistical guarantees despite of its simplicity. Furthermore, it is even as fast as the classic multiply-mod-prime scheme, i.e., ( + ) mod .…”

Section: Notation and Backgroundmentioning

confidence: 99%

“…Assuming each element in  is represented in 32 bits (the hash function can also be used to hash 64-bit stream items [16]) and the desired output is also 32 bits, tabulation hashing works as follows: rst a 4×256 table is generated and lled with random 32-bit values. Given a 32-bit input , each character, i.e., 8-bit value, of is used as an index for the corresponding row.…”

Section: Notation and Backgroundmentioning

confidence: 99%

See 1 more Smart Citation

One Table to Count Them All: Parallel Frequency Estimation on Single-Board Computers

Taşyaran

Yıldırır

Taş

et al. 2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Sketches are probabilistic data structures that can provide approximate results within mathematically proven error bounds while using orders of magnitude less memory than traditional approaches. They are tailored for streaming data analysis on architectures even with limited memory such as single-board computers that are widely exploited for IoT and edge computing. Since these devices o er multiple cores, with e cient parallel sketching schemes, they are able to manage high volumes of data streams. However, since their caches are relatively small, a careful parallelization is required. In this work, we focus on the frequency estimation problem and evaluate the performance of a high-end server, a 4-core Raspberry Pi and an 8-core Odroid. As a sketch, we employed the widely used Count-Min Sketch. To hash the stream in parallel and in a cache-friendly way, we applied a novel tabulation approach and rearranged the auxiliary tables into a single one. To parallelize the process with performance, we modi ed the work ow and applied a form of bu ering between hash computations and sketch updates. Today, many single-board computers have heterogeneous processors in which slow and fast cores are equipped together. To utilize all these cores to their full potential, we proposed a dynamic load-balancing mechanism which significantly increased the performance of frequency estimation.

show abstract

Section: Notation and Backgroundmentioning

confidence: 99%

Section: Notation and Backgroundmentioning

confidence: 99%

One Table to Count Them All: Parallel Frequency Estimation on Single-Board Computers

Taşyaran

Yıldırır

Taş

et al. 2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…We sample a MinHash function h by sampling a random Zobrist hash function g and let h(x) = argmin j∈x g(j). Zobrist hashing (also known as simple tabulation hashing) has been shown theoretically to have strong MinHash properties and is very fast in practice [27], [28]. We set t = 128 in our experiments, see discussion later.…”

Section: A Chosen Path Similarity Joinmentioning

confidence: 99%

Scalable and Robust Set Similarity Join

Christiani

Pagh

Sivertsen

2018

2018 IEEE 34th International Conference on Data Engineering (ICDE)

View full text Add to dashboard Cite

Set similarity join is a fundamental and wellstudied database operator. It is usually studied in the exact setting where the goal is to compute all pairs of sets that exceed a given similarity threshold (measured e.g. as Jaccard similarity). But set similarity join is often used in settings where 100% recall may not be importantindeed, where the exact set similarity join is itself only an approximation of the desired result set.We present a new randomized algorithm for set similarity join that can achieve any desired recall up to 100%, and show theoretically and empirically that it significantly improves on existing methods. The present state-of-the-art exact methods are based on prefix-filtering, the performance of which depends on the data set having many rare tokens. Our method is robust against the absence of such structure in the data. At 90% recall our algorithm is often more than an order of magnitude faster than state-of-the-art exact methods, depending on how well a data set lends itself to prefix filtering. Our experiments on benchmark data sets also show that the method is several times faster than comparable approximate methods. Our algorithm makes use of recent theoretical advances in highdimensional sketching and indexing that we believe to be of wider relevance to the data engineering community.

show abstract

“…Linear Probing or Progressive Overflow is a classic implementation of the hash table. It uses the hash h function to map a set of n keys into an m size array [13]. In the Po method, if the kay is crashd then the kay value wil be placed on the next index that is still empaty.…”

Section: Progressive Overflowmentioning

confidence: 99%

Performance Analysis of Hashing Methods on the Employment of App

Yudhana¹,

Fadlil²,

Prianto³

2018

IJECE

View full text Add to dashboard Cite

The administrative process carried out continuously produces large data. So the search process takes a long time. The search process by hashing methods can save time faster. Hashing is methods that directly access data in a table by making references to the key that hashing becomes the address in the table. The performance analysis of the hashing method is done by the number of 18 digit character values. The process of analysis is done on applications that have been implemented in the application. The algorithm of hashing method analyzed is progressive overflow (PO) and linear quotient (LQ). The main purpose of performance analysis of hashing method is to know how gig the performance of each method. The results analyzed showed the average value of collision with 15 keys in the analysis of 53.3% yield the same value, while 46.7% showed the linear quotient has better performance.

show abstract

Fast and powerful hashing using tabulation

Cited by 9 publications

References 47 publications

One Table to Count Them All: Parallel Frequency Estimation on Single-Board Computers

One Table to Count Them All: Parallel Frequency Estimation on Single-Board Computers

Scalable and Robust Set Similarity Join

Performance Analysis of Hashing Methods on the Employment of App

Contact Info

Product

Resources

About