Distance-Sensitive Hashing

Aumüller, Martin; Christiani, Tobias; Pagh, Rasmus; Silvestri, Francesco

doi:10.1145/3196959.3196976

Cited by 16 publications

(12 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Approximating Log-convex Functions via Distance Sensitive Hashing. We show that this is indeed possible for log-convex functions of the inner product by utilizing a family of hashing schemes introduced recently by Aumuller et al [15], referred to as Distance Sensitive Hashing (DSH). This family is defined through two parameters γ 0 and s > 0, with collision probability p γ,s (ρ) having the following dependence on the inner product ρ x, y between two vectors x, y ∈ S d−1…”

Section: Do There Exist Design Principles For {W T } and {P T } That mentioning

confidence: 87%

See 1 more Smart Citation

Multi-resolution Hashing for Fast Pairwise Summations

Charikar

Siminelakis

2019

2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS)

View full text Add to dashboard Cite

A basic computational primitive in the analysis of massive datasets is summing simple functions over a large number of objects. Modern applications pose an additional challenge in that such functions often depend on a parameter vector y (query) that is unknown a priori. Given a set of points X ⊂ d and a pairwise function w : d × d → [0, 1], we study the problem of designing a data-structure that enables sublinear-time approximation of the summation Z w (y) 1 |X| x∈X w(x, y) for any query y ∈ d . By combining ideas from Harmonic Analysis (partitions of unity and approximation theory) with Hashing-Based-Estimators [Charikar, Siminelakis FOCS'17], we provide a general framework for designing such data structures through hashing that reaches far beyond what previous techniques allowed.A key design principle is a collection of T 1 hashing schemes with collision probabilities p 1 , . . . , p T such that sup t∈[T] {p t (x, y)} Θ( w(x, y)). This leads to a data-structure that approximates Z w (y) using a sub-linear number of samples from each hash family. Using this new framework along with Distance Sensitive Hashing [Aumuller, Christiani, Pagh, Silvestri PODS'18], we show that such a collection can be constructed and evaluated efficiently for any log-convex function w(x, y) e φ( x, y ) of the inner product on the unit sphere x, y ∈ S d−1 .Our method leads to data structures with sub-linear query time that significantly improve upon random sampling and can be used for Kernel Density or Partition Function Estimation. We provide extensions of our result from the sphere to d and from scalar functions to vector functions.

show abstract

Section: Do There Exist Design Principles For {W T } and {P T } That mentioning

confidence: 87%

“…To analyze the collision probability of the DSH scheme we closely follow the proof of Aumuller et al [15] with the difference that we use Proposition 9.2 to bound bi-variate Gaussian integrals.…”

Section: Distance Sensitive Hashing On the Unit Spherementioning

confidence: 99%

Multi-resolution Hashing for Fast Pairwise Summations

Charikar

Siminelakis

2019

2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS)

View full text Add to dashboard Cite

show abstract

“…An interesting open question is to investigate the applicability of our data structures for problems like discrimination discovery [56], diversity in recommender systems [3], privacy preserving similarity search [65], and estimation of kernel density [22]. Moreover, it would be interesting to investigate techniques for providing incentives (i.e., reverse discrimination [56]) to prevent discrimination: an idea could be to merge the data structures in this paper with distance-sensitive hashing functions in [13], which allow to implement hashing schemes where the collision probability is an (almost) arbitrary function of the distance. Further, the techniques presented here require a manual trade-of between the performance of the LSH part and the additional running time contribution from inding the near points among the non-far points.…”

Section: Discussionmentioning

confidence: 99%

Sampling a Near Neighbor in High Dimensions — Who is the Fairest of Them All?

Aumüller

Har-Peled

Mahabadi

et al. 2022

ACM Trans. Database Syst.

Self Cite

View full text Add to dashboard Cite

Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points S and a radius parameter r > 0, the r -near neighbor ( r -NN) problem asks for a data structure that, given any query point q , returns a point p within distance at most r from q . In this paper, we study the r -NN problem in the light of individual fairness and providing equal opportunities: all points that are within distance r from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. In this work, we show that LSH based algorithms can be made fair, without a significant loss in efficiency. We propose several efficient data structures for the exact and approximate variants of the fair NN problem. Our approach works more generally for sampling uniformly from a sub-collection of sets of a given collection and can be used in a few other applications. We also develop a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights the unfairness of state-of-the-art NN data structures and shows the performance of our algorithms on real-world datasets.

show abstract

“…More recently, researchers have begun leveraging LSH techniques to solve problems beyond ANN, extending their domain to applications around density estimation for high-dimensional models. For example [ACPS17] generalizes nearest neighbor LSH hash functions to be sensitive to custom distance ranges. [AAP17] builds many different parameterized versions of the prototypical LSH hash tables and adaptively probes them for spherical range reporting.…”

Section: Related Literaturementioning

confidence: 99%

Local Density Estimation in High Dimensions

Wu¹,

Charikar²,

Natchu³

2018

Preprint

View full text Add to dashboard Cite

An important question that arises in the study of high dimensional vector representations learned from data is: given a set D of vectors and a query q, estimate the number of points within a specified distance threshold of q. We develop two estimators, LSH Count and Multi-Probe Count that use locality sensitive hashing to preprocess the data to accurately and efficiently estimate the answers to such questions via importance sampling. A key innovation is the ability to maintain a small number of hash tables via preprocessing data structures and algorithms that sample from multiple buckets in each hash table. We give bounds on the space requirements and sample complexity of our schemes, and demonstrate their effectiveness in experiments on a standard word embedding dataset.

show abstract

Distance-Sensitive Hashing

Cited by 16 publications

References 52 publications

Multi-resolution Hashing for Fast Pairwise Summations

Multi-resolution Hashing for Fast Pairwise Summations

Sampling a Near Neighbor in High Dimensions — Who is the Fairest of Them All?

Local Density Estimation in High Dimensions

Contact Info

Product

Resources

About