, or permissions@acm.org.
· S. Arya, et al.Consider a set S of n data points in real d-dimensional space, R d , where distances are measured using any Minkowski metric. In nearest neighbor searching we preprocess S into a data structure, so that given any query point q ∈ R d , the closest point of S to q can be reported quickly. Given any positive real , a data point p is a (1 + )-approximate nearest neighbor of q if its distance from q is within a factor of (1 + ) of the distance to the true nearest neighbor. We show that it is possible to preprocess a set of n points in R d in O(dn log n) time and O(dn) space, so that given a query point q ∈ R d , and > 0, a (1 + )-approximate nearest neighbor of q can be computed ind is a factor depending only on dimension and . In general, we show that given an integer k ≥ 1, (1 + )-approximations to the k nearest neighbors of q can be computed in additional O(kd log n) time.
In k-means clustering we are given a set of n data points in d-dimensional space d and an integer k, and the problem is to determine a set of k points in d , called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomial-time algorithms are known for this problem. Although asymptotically efficient approximation algorithms exist, these algorithms are not practical due to the very high constant factors involved. There are many heuristics that are used in practice, but we know of no bounds on their performance. We consider the question of whether there exists a simple and practical approximation algorithm for k-means clustering. We present a local improvement heuristic based on swapping centers in and out. We prove that this yields a (9 + ε)-approximation algorithm. We present an example showing that any approach based on performing a fixed number of swaps achieves an approximation factor of at least (9 − ε) in all sufficiently high dimensions. Thus, our approximation factor is almost tight for algorithms based on performing a fixed number of swaps. To establish the practical value of the heuristic, we present an empirical study that shows that, when combined with Lloyd's algorithm, this heuristic performs quite well in practice.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.