Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining 2018
DOI: 10.1145/3219819.3220089
|View full text |Cite
|
Sign up to set email alerts
|

BagMinHash - Minwise Hashing Algorithm for Weighted Sets

Abstract: Minwise hashing has become a standard tool to calculate signatures which allow direct estimation of Jaccard similarities. While very e cient algorithms already exist for the unweighted case, the calculation of signatures for weighted sets is still a time consuming task. BagMinHash is a new algorithm that can be orders of magnitude faster than current state of the art without any particular restrictions or assumptions on weights or data dimensionality. Applied to the special case of unweighted sets, it represen… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
34
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
5
2
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 36 publications
(41 citation statements)
references
References 33 publications
1
34
0
Order By: Relevance
“…Finch [28] works by capturing more sketch items than strictly needed for the k-bottom sketch, then tallying them into a multiset. More theoretical studies have proposed ways to store multiplicities, including BagMinHash [29], and Super-MinHash [14]. In the future, it will be important to seek similar multiplicity-preserving extensions-and related extensions like tf-idf weighting [3,30]-for HLL as well.…”
Section: Discussionmentioning
confidence: 99%
“…Finch [28] works by capturing more sketch items than strictly needed for the k-bottom sketch, then tallying them into a multiset. More theoretical studies have proposed ways to store multiplicities, including BagMinHash [29], and Super-MinHash [14]. In the future, it will be important to seek similar multiplicity-preserving extensions-and related extensions like tf-idf weighting [3,30]-for HLL as well.…”
Section: Discussionmentioning
confidence: 99%
“…Finch [27] works by capturing more sketch items than strictly needed for the k-bottom sketch, then tallying them into a multiset. More theoretical studies have proposed ways to store multiplicities, including BagMinHash [28], and SuperMinHash [14]. In the future it will be important to seek similar multiplicitypreserving extensions -and related extensions like tf-idf weighting [3,29] -for HLL as well.…”
Section: Discussionmentioning
confidence: 99%
“…The CWS algorithm and its variants all have the time complexity of O(n + k), where n + is the number of elements with positive weights. Recently, Otmar [37] proposed another efficient algorithm BagMinHash for handling high dimensional vectors. BagMinHash is faster than ICWS when the vector has a large number of positive elements, e.g., n + > 1, 000, which may not hold for many real-world datasets.…”
Section: Related Work 21 Jaccard Similarity Estimationmentioning
confidence: 99%
“…For task 1, we compare our method with P-MinHash [9] on probability Jaccard similarity estimation to evaluate the performance of FastGM. To highlight the efficiency of FastGM, we further compare FastGM with the state-of-the-art weighted Jaccard similarity estimation method, BagMinHash [37]. Notice that BagMinHash estimates a different metric and thus we only show results on efficiency.…”
Section: Baselinementioning
confidence: 99%