Given two sets A and B of multidimensional objects, the all-nearest-neighbors (ANN) query retrieves for each object in A its nearest neighbor in B. Although this operation is common in several applications, it has not received much attention in the database literature. In this paper we study alternative methods for processing ANN queries depending on whether A and B are indexed. Our algorithms are evaluated through extensive experimentation using synthetic and real datasets. The performance studies show that they are an order of magnitude faster than a previous approach based on closest-pairs query processing.
1Introduction Let A and B be two spatial datasets and dist(p,q) be a distance metric. Then, the all-nearest-neighbors query is defined as: . In this paper, we propose novel techniques for general ANN query processing. Following the common trend in the literature, we assume that the underlying indexes (whenever available) are R-trees [Gut84, BKSS90]. Although, for simplicity, we deal with points and use Euclidean distance, extensions to other data partition access methods, extended objects and other distance metrics are straightforward. The rest of the paper is organized as follows. Section 2 discusses previous work directly related to the ANN problem. Sections 3 and 4 present algorithms for different cases, based on whether A, B, or both are indexed. Section 5 experimentally evaluates the algorithms, and section 6 concludes the paper with a discussion.
2Related work ANN queries constitute a hybrid of nearest neighbor search and spatial joins; therefore, in sections 2.1 and 2.2 we review related work for these query types focusing more on the processing techniques that are also employed by our algorithms. Section 2.3 describes methods for closest-pair queries, and section 2.4 discusses existing techniques for ANN query processing.
2.1Nearest neighbor queries The goal of nearest neighbor (NN) search is to find the objects in a dataset A that are closest to a query point q. Existing algorithms presume that the dataset is indexed by an R-tree and use various metrics to prune the search space: mindist(q,M) is the minimum distance between q and any point in a minimum bounding rectangle (MBR) M. The algorithm of [RKV95] traverses the tree in a depth-first (DF) manner. Assume that we search for the nearest neighbor NN(q,R) of q in R-tree R. Starting from the root, all entries are sorted according to their mindist from q, and the entry with the smallest mindist is visited first. The process is repeated recursively until the leaf level where a potential nearest neighbor is found. During backtracking to the upper levels, the algorithm only visits entries whose mindist is smaller than the distance of the nearest neighbor found so far. As an example consider the R-tree of Figure 1, where the number in each entry refers to the mindist (for intermediate entries) or the actual distance (for leaf entries, i.e., objects) from q (these numbers are not stored but computed dynamically during query processing). DF would...