PM-LSH: a fast and accurate in-memory framework for high-dimensional approximate NN and closest pair search

Zheng, Bolong; Zhao, Xi; Weng, Lianggui; Nguyen, Quoc Viet Hung; Liu, Huan; Jensen, Christian S.

doi:10.1007/s00778-021-00680-7

Cited by 22 publications

(33 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For 1000 randomly chosen query points, we report the final radius values (using the Virtual Rehashing technique from C2LSH [13] and QALSH [16]) for top-100 points. This observation was also noted by a very recent paper [43] where the authors show that the homogeneity of the distance distributions of data points in different high-dimensional datasets is very high. By leveraging this simple observation, we design an improved, simple, and effective Virtual Rehashing technique: we execute a sample set of randomly chosen queries for a given k and count the number of occurrences of the final radius value.…”

Section: Improved Virtual Rehashing Strategysupporting

confidence: 75%

“…Recently, HD-Index [1] was introduced which generated Hilbert keys of the dataset objects and also stored the distances of the objects to each other to efficiently prune the results based on distance filters. Very recently, PM-LSH [43] was proposed where the idea was to estimate the Euclidean distance based on a tunable confidence interval value such that the overall query processing time is reduced. Query Workloads in High-Dimensional Spaces: Until now, only two works [30,17] have been proposed that focus on efficient execution of query workloads in high-dimensional spaces.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

mmLSH: A Practical and Efficient Technique for Processing Approximate Nearest Neighbor Queries on Multimedia Data

Jafari,

Nagarkar,

Montaño

2020

Preprint

View full text Add to dashboard Cite

Many large multimedia applications require efficient processing of nearest neighbor queries. Often, multimedia data are represented as a collection of important high-dimensional feature vectors. Locality Sensitive Hashing (LSH) is a very popular approximate technique for finding nearest neighbors in high-dimensional spaces. In order to find top-k similar multimedia objects, existing LSH techniques require users to find top-k similar feature vectors for each of the feature vectors that represent the query object. This leads to wasted and redundant work due to two main reasons: 1) not all feature vectors may contribute equally in finding the top-k similar multimedia objects, and 2) feature vectors are treated independently during query processing. Additionally, there is no theoretical guarantee on the returned multimedia results. In this work, we propose a practical and efficient indexing approach for finding top-k approximate nearest neighbors for multimedia data using LSH, called mmLSH. In mmLSH, we present novel strategies to find nearest neighbor objects for a given multimedia object query. We also provide theoretical guarantees on the returned multimedia results. Additionally, we present a buffer-conscious strategy to speedup the query processing. Experimental evaluation shows significant gains in performance time and accuracy for different real multimedia datasets when compared against state-ofthe-art LSH techniques.

show abstract

Section: Improved Virtual Rehashing Strategysupporting

confidence: 75%

Section: Related Workmentioning

confidence: 99%

mmLSH: A Practical and Efficient Technique for Processing Approximate Nearest Neighbor Queries on Multimedia Data

Jafari,

Nagarkar,

Montaño

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…While it offers comparable update throughput and search performance as our system, it ends up needing 25X more machines due to the high RAM consumption. A similar issue can be seen with PM-LSH, another state-of-art system based on LSH [62], where the memory footprint is a bit lower than PLSH (due to the system using fewer LSH tables), but the query latencies are an order of magnitude slower than our system and PLSH. Alternately, disk-based LSH indices such as SRS [53] can host a billion-point index on a single machine, but the query latencies are extremely slow with the system fetching around 15% of the total index (running into GBs per query) from the disk to provide good accuracy.…”

Section: Shortcoming Of Existing Algorithmssupporting

confidence: 66%

“…In a breakthrough result, Indyk and Motwani [32] show that a class of algorithms, known as locality sensitive hashing can yield provably approximate solutions to the ANNS problem with a polynomially-sized index and sublinear query time. Subsequent to this work, there has been a plethora of different LSH-based algorithms [3,32,62], including those which depend on the data [4], use spectral methods [61], distributed LSH [54], etc. While the advantage of the simpler data-independent hashing methods are that updates are almost trivial, the indices are often entirely resident in DRAM and hence do not scale very well.…”

Section: Related Workmentioning

confidence: 99%

FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search

Singh,

Subramanya,

Krishnaswamy

et al. 2021

Preprint

View full text Add to dashboard Cite

Approximate nearest neighbor search (ANNS) is a fundamental building block in information retrieval with graphbased indices being the current state-of-the-art [7] and widely used in the industry. Recent advances [51] in graph-based indices have made it possible to index and search billion-point datasets with high recall and millisecond-level latency on a single commodity machine with an SSD.However, existing graph algorithms for ANNS support only static indices that cannot reflect real-time changes to the corpus required by many key real-world scenarios (e.g. index of sentences in documents, email or a news index). To overcome this drawback, the current industry practice for manifesting updates into such indices is to periodically re-build these indices, which can be prohibitively expensive.In this paper, we present the first graph-based ANNS index that reflects corpus updates into the index in real-time without compromising on search performance. Using update rules for this index, we design FreshDiskANN, a system that can index over a billion points on a workstation with an SSD and limited memory, and support thousands of concurrent real-time inserts, deletes and searches per second each, while retaining > 95% 5-recall@5. This represents a 5-10x reduction in the cost of maintaining freshness in indices when compared to existing methods.

show abstract

“…We choose HNSW [28] under two considerations: (1) HNSW is a k-Nearest Neighbor Graph (k-NNG) based method [9,28,11,38], which is very efficient for k-NNS [3]. (2) Compared with the methods based on Locality-Sensitive Hashing (LSH) [18,7,12,33,15,14,23,27,39,24,16] and Product Quantization [19,20,34], the HNSW graph G T directly stores the kNNs N T (t) of t. Thus, we can directly retrieve the N T (t) for all t ∈ P (T ) without conducting k-NNS again in the query phase. Notice that HNSW uses a priority queue to perform k-NNS.…”

Section: Pre-processing Phasementioning

confidence: 99%

DIOT: Detecting Implicit Obstacles from Trajectories

Lei¹,

Huang²,

Kankanhalli³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this paper, we study a new data mining problem of obstacle detection from trajectory data. Intuitively, given two kinds of trajectories, i.e., reference and query trajectories, the obstacle is a region such that most query trajectories need to bypass this region, whereas the reference trajectories can go through as usual. We introduce a densitybased definition for the obstacle based on a new normalized Dynamic Time Warping (nDTW) distance and the density functions tailored for the sub-trajectories to estimate the density variations. With this definition, we introduce a novel framework DIOT that utilizes the depth-first search method to detect implicit obstacles. We conduct extensive experiments over two real-life data sets. The experimental results show that DIOT can capture the nature of obstacles yet detect the implicit obstacles efficiently and effectively. Code is available at https://github.com /1flei/obstacle.

show abstract

PM-LSH: a fast and accurate in-memory framework for high-dimensional approximate NN and closest pair search

Cited by 22 publications

References 49 publications

mmLSH: A Practical and Efficient Technique for Processing Approximate Nearest Neighbor Queries on Multimedia Data

mmLSH: A Practical and Efficient Technique for Processing Approximate Nearest Neighbor Queries on Multimedia Data

FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search

DIOT: Detecting Implicit Obstacles from Trajectories

Contact Info

Product

Resources

About