BagMinHash - Minwise Hashing Algorithm for Weighted Sets

Ertl, Otmar

doi:10.1145/3219819.3220089

Cited by 36 publications

(41 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finch [28] works by capturing more sketch items than strictly needed for the k-bottom sketch, then tallying them into a multiset. More theoretical studies have proposed ways to store multiplicities, including BagMinHash [29], and Super-MinHash [14]. In the future, it will be important to seek similar multiplicity-preserving extensions-and related extensions like tf-idf weighting [3,30]-for HLL as well.…”

Section: Discussionmentioning

confidence: 99%

Dashing: fast and accurate genomic distances with HyperLogLog

Baker¹,

Langmead²

2019

Genome Biol

View full text Add to dashboard Cite

Dashing is a fast and accurate software tool for estimating similarities of genomes or sequencing datasets. It uses the HyperLogLog sketch together with cardinality estimation methods that are specialized for set unions and intersections. Dashing summarizes genomes more rapidly than previous MinHash-based methods while providing greater accuracy across a wide range of input sizes and sketch sizes. It can sketch and calculate pairwise distances for over 87K genomes in 6 minutes. Dashing is open source and available at https://github.com/dnbaker/dashing.

show abstract

Section: Discussionmentioning

confidence: 99%

Dashing: fast and accurate genomic distances with HyperLogLog

Baker¹,

Langmead²

2019

Genome Biol

View full text Add to dashboard Cite

show abstract

“…Finch [27] works by capturing more sketch items than strictly needed for the k-bottom sketch, then tallying them into a multiset. More theoretical studies have proposed ways to store multiplicities, including BagMinHash [28], and SuperMinHash [14]. In the future it will be important to seek similar multiplicitypreserving extensions -and related extensions like tf-idf weighting [3,29] -for HLL as well.…”

Section: Discussionmentioning

confidence: 99%

Dashing: Fast and Accurate Genomic Distances with HyperLogLog

Baker

Langmead

2018

Preprint

View full text Add to dashboard Cite

show abstract

“…The CWS algorithm and its variants all have the time complexity of O(n + k), where n + is the number of elements with positive weights. Recently, Otmar [37] proposed another efficient algorithm BagMinHash for handling high dimensional vectors. BagMinHash is faster than ICWS when the vector has a large number of positive elements, e.g., n + > 1, 000, which may not hold for many real-world datasets.…”

Section: Related Work 21 Jaccard Similarity Estimationmentioning

confidence: 99%

“…For task 1, we compare our method with P-MinHash [9] on probability Jaccard similarity estimation to evaluate the performance of FastGM. To highlight the efficiency of FastGM, we further compare FastGM with the state-of-the-art weighted Jaccard similarity estimation method, BagMinHash [37]. Notice that BagMinHash estimates a different metric and thus we only show results on efficiency.…”

Section: Baselinementioning

confidence: 99%

Fast Generating A Large Number of Gumbel-Max Variables

Wang

Zhang

et al. 2020

Proceedings of the Web Conference 2020

View full text Add to dashboard Cite

The well-known Gumbel-Max Trick for sampling elements from a categorical distribution (or more generally a nonnegative vector) and its variants have been widely used in areas such as machine learning and information retrieval. To sample a random element i (or a Gumbel-Max variable i) in proportion to its positive weight v i , the Gumbel-Max Trick first computes a Gumbel random variable д i for each positive weight element i, and then samples the element i with the largest value of д i + ln v i . Recently, applications including similarity estimation and graph embedding require to generate k independent Gumbel-Max variables from high dimensional vectors. However, it is computationally expensive for a large k (e.g., hundreds or even thousands) when using the traditional Gumbel-Max Trick. To solve this problem, we propose a novel algorithm, FastGM, that reduces the time complexity from O(kn + ) to O(k ln k + n + ), where n + is the number of positive elements in the vector of interest. Instead of computing k independent Gumbel random variables directly, we find that there exists a technique to generate these variables in descending order. Using this technique, our method FastGM computes variables д i + ln v i for all positive elements i in descending order. As a result, FastGM significantly reduces the computation time because we can stop the procedure of Gumbel random variables computing for many elements especially for those with small weights. Experiments on a variety of real-world datasets show that FastGM is orders of magnitude faster than state-of-theart methods without sacrificing accuracy and incurring additional expenses.

show abstract

BagMinHash - Minwise Hashing Algorithm for Weighted Sets

Cited by 36 publications

References 33 publications

Dashing: fast and accurate genomic distances with HyperLogLog

Dashing: fast and accurate genomic distances with HyperLogLog

Dashing: Fast and Accurate Genomic Distances with HyperLogLog

Fast Generating A Large Number of Gumbel-Max Variables

Contact Info

Product

Resources

About