Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

Shrivastava, Anshumali; Li, Ping

doi:10.1145/2736277.2741285

Cited by 69 publications

(86 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As example, weighted minwise hashing was successfully applied to train generalized min-max kernel support vector machines [18,22]. Furthermore, asymmetric locality-sensitive hashing [33], which can be realized using weighted minwise hashing, was used for e cient deep learning [34]. Finally, it could also be applied to random forests that are constructed using the weighted Jaccard index as similarity measure [30].…”

Section: Applicationsmentioning

confidence: 99%

BagMinHash - Minwise Hashing Algorithm for Weighted Sets

Ertl

2018

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining

View full text Add to dashboard Cite

Minwise hashing has become a standard tool to calculate signatures which allow direct estimation of Jaccard similarities. While very e cient algorithms already exist for the unweighted case, the calculation of signatures for weighted sets is still a time consuming task. BagMinHash is a new algorithm that can be orders of magnitude faster than current state of the art without any particular restrictions or assumptions on weights or data dimensionality. Applied to the special case of unweighted sets, it represents the rst e cient algorithm producing independent signature components. A series of tests nally veri es the new algorithm and also reveals limitations of other approaches published in the recent past.Sometimes it is more convenient to represent objects as bags also known as multisets, where each element is associated with a nonnegative weight. For example, words or shingles in text documents

show abstract

Section: Applicationsmentioning

confidence: 99%

BagMinHash - Minwise Hashing Algorithm for Weighted Sets

Ertl

2018

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining

View full text Add to dashboard Cite

show abstract

“…WHIMP gets a precision and recall above 0.7 for at least 75% of the sample. We stress the low values of cosine similarities here: a similarity of 0.2 is well-below the values studied in recent LSH-based results [37,39,38]. It is well-known that low similarity values are harder to detect, yet WHIMP gets accurate results for an overwhelming majority of the vertices/users.…”

Section: Resultsmentioning

confidence: 67%

When Hashes Met Wedges

Sharma¹,

Seshadhri²,

Goel³

2017

Proceedings of the 26th International Conference on World Wide Web

View full text Add to dashboard Cite

Finding similar user pairs is a fundamental task in social networks, with numerous applications in ranking and personalization tasks such as link prediction and tie strength detection. A common manifestation of user similarity is based upon network structure: each user is represented by a vector that represents the user's network connections, where pairwise cosine similarity among these vectors defines user similarity. The predominant task for user similarity applications is to discover all similar pairs that have a pairwise cosine similarity value larger than a given threshold τ . In contrast to previous work where τ is assumed to be quite close to 1, we focus on recommendation applications where τ is small, but still meaningful. The all pairs cosine similarity problem is computationally challenging on networks with billions of edges, and especially so for settings with small τ . To the best of our knowledge, there is no practical solution for computing all user pairs with, say τ = 0.2 on large social networks, even using the power of distributed algorithms.Our work directly addresses this challenge by introducing a new algorithm -WHIMP -that solves this problem efficiently in the MapReduce model. The key insight in WHIMP is to combine the "wedge-sampling" approach of Cohen-Lewis for approximate matrix multiplication with the SimHash random projection techniques of Charikar. We provide a theoretical analysis of WHIMP, proving that it has near optimal communication costs while maintaining computation cost comparable with the state of the art. We also empirically demonstrate WHIMP's scalability by computing all highly similar pairs on four massive data sets, and show that it accurately finds high similarity pairs. In particular, we note that WHIMP successfully processes the entire Twitter network, which has tens of billions of edges.

show abstract

“…Compared with symmetric similarity such as Jaccard similarity, containment similarity gives special consideration on the query size, which makes it more suitable in some applications. As shown in [35], containment similarity is useful in record matching application. Given two text descriptions of two restaurants X and Y which are represented by two "set of words" records: {five, guys, burgers, and, fries, downtown, brooklyn, new, york} and {five, kitchen, berkeley} respectively.…”

Section: Introductionmentioning

confidence: 99%

“…Challenges. The problem of containment similarity search has been intensively studied in the literature in recent years (e.g., [5], [35], [44]). The key challenges of this problem come from the following three aspects: (i) The number of elements (i.e., vocabulary size) may be very large.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

GB-KMV: An Augmented KMV Sketch for Approximate Containment Similarity Search

Yang

Zhang

et al. 2019

2019 IEEE 35th International Conference on Data Engineering (ICDE)

View full text Add to dashboard Cite

In this paper, we study the problem of approximate containment similarity search. Given two records Q and X, the containment similarity between Q and X with respect to Q is |Q∩X| |Q| . Given a query record Q and a set of records S, the containment similarity search finds a set of records from S whose containment similarity regarding Q is not less than the given threshold. This problem has many important applications in commercial and scientific fields such as record matching and domain search. Existing solution relies on the asymmetric LSH method by transforming the containment similarity to well-studied Jaccard similarity. In this paper, we use a inherently different framework by transforming the containment similarity to set intersection. We propose a novel augmented KMV sketch technique, namely GB-KMV, which is data-dependent and can achieve a much better trade-off between the sketch size and the accuracy. We provide a set of theoretical analysis to underpin the proposed augmented KMV sketch technique, and show that it outperforms the state-ofthe-art technique LSH-E in terms of estimation accuracy under practical assumption. Our comprehensive experiments on real-life datasets verify that GB-KMV is superior to LSH-E in terms of the space-accuracy trade-off, time-accuracy trade-off, and the sketch construction time. For instance, with similar estimation accuracy (F-1 score), GB-KMV is over 100 times faster than LSH-E on several real-life datasets.

show abstract

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

Cited by 69 publications

References 25 publications

BagMinHash - Minwise Hashing Algorithm for Weighted Sets

BagMinHash - Minwise Hashing Algorithm for Weighted Sets

When Hashes Met Wedges

GB-KMV: An Augmented KMV Sketch for Approximate Containment Similarity Search

Contact Info

Product

Resources

About