2021
DOI: 10.1007/s00778-021-00680-7
|View full text |Cite
|
Sign up to set email alerts
|

PM-LSH: a fast and accurate in-memory framework for high-dimensional approximate NN and closest pair search

Abstract: Nearest neighbor (NN) search is inherently computationally expensive in high-dimensional spaces due to the curse of dimensionality. As a well-known solution, locality-sensitive hashing (LSH) is able to answer c-approximate NN (c-ANN) queries in sublinear time with constant probability. Existing LSH methods focus mainly on building hash bucket-based indexing such that the candidate points can be retrieved quickly. However, existing coarse-grained structures fail to offer accurate distance estimation for candida… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
30
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 22 publications
(33 citation statements)
references
References 49 publications
3
30
0
Order By: Relevance
“…For 1000 randomly chosen query points, we report the final radius values (using the Virtual Rehashing technique from C2LSH [13] and QALSH [16]) for top-100 points. This observation was also noted by a very recent paper [43] where the authors show that the homogeneity of the distance distributions of data points in different high-dimensional datasets is very high. By leveraging this simple observation, we design an improved, simple, and effective Virtual Rehashing technique: we execute a sample set of randomly chosen queries for a given k and count the number of occurrences of the final radius value.…”
Section: Improved Virtual Rehashing Strategysupporting
confidence: 75%
See 1 more Smart Citation
“…For 1000 randomly chosen query points, we report the final radius values (using the Virtual Rehashing technique from C2LSH [13] and QALSH [16]) for top-100 points. This observation was also noted by a very recent paper [43] where the authors show that the homogeneity of the distance distributions of data points in different high-dimensional datasets is very high. By leveraging this simple observation, we design an improved, simple, and effective Virtual Rehashing technique: we execute a sample set of randomly chosen queries for a given k and count the number of occurrences of the final radius value.…”
Section: Improved Virtual Rehashing Strategysupporting
confidence: 75%
“…Recently, HD-Index [1] was introduced which generated Hilbert keys of the dataset objects and also stored the distances of the objects to each other to efficiently prune the results based on distance filters. Very recently, PM-LSH [43] was proposed where the idea was to estimate the Euclidean distance based on a tunable confidence interval value such that the overall query processing time is reduced. Query Workloads in High-Dimensional Spaces: Until now, only two works [30,17] have been proposed that focus on efficient execution of query workloads in high-dimensional spaces.…”
Section: Related Workmentioning
confidence: 99%
“…While it offers comparable update throughput and search performance as our system, it ends up needing 25X more machines due to the high RAM consumption. A similar issue can be seen with PM-LSH, another state-of-art system based on LSH [62], where the memory footprint is a bit lower than PLSH (due to the system using fewer LSH tables), but the query latencies are an order of magnitude slower than our system and PLSH. Alternately, disk-based LSH indices such as SRS [53] can host a billion-point index on a single machine, but the query latencies are extremely slow with the system fetching around 15% of the total index (running into GBs per query) from the disk to provide good accuracy.…”
Section: Shortcoming Of Existing Algorithmssupporting
confidence: 66%
“…In a breakthrough result, Indyk and Motwani [32] show that a class of algorithms, known as locality sensitive hashing can yield provably approximate solutions to the ANNS problem with a polynomially-sized index and sublinear query time. Subsequent to this work, there has been a plethora of different LSH-based algorithms [3,32,62], including those which depend on the data [4], use spectral methods [61], distributed LSH [54], etc. While the advantage of the simpler data-independent hashing methods are that updates are almost trivial, the indices are often entirely resident in DRAM and hence do not scale very well.…”
Section: Related Workmentioning
confidence: 99%
“…We choose HNSW [28] under two considerations: (1) HNSW is a k-Nearest Neighbor Graph (k-NNG) based method [9,28,11,38], which is very efficient for k-NNS [3]. (2) Compared with the methods based on Locality-Sensitive Hashing (LSH) [18,7,12,33,15,14,23,27,39,24,16] and Product Quantization [19,20,34], the HNSW graph G T directly stores the kNNs N T (t) of t. Thus, we can directly retrieve the N T (t) for all t ∈ P (T ) without conducting k-NNS again in the query phase. Notice that HNSW uses a priority queue to perform k-NNS.…”
Section: Pre-processing Phasementioning
confidence: 99%