DBSCAN is a method proposed in 1996 for clustering multi-dimensional points, and has received extensive applications. Its computational hardness is still unsolved to this date. The original KDD‚96 paper claimed an algorithm of O ( n log n ) ”average runtime complexity„ (where n is the number of data points) without a rigorous proof. In 2013, a genuine O ( n log n )-time algorithm was found in 2D space under Euclidean distance. The hardness of dimensionality d ≥3 has remained open ever since. This article considers the problem of computing DBSCAN clusters from scratch (assuming no existing indexes) under Euclidean distance. We prove that, for d ≥3, the problem requires ω( n 4/3 ) time to solve, unless very significant breakthroughs—ones widely believed to be impossible—could be made in theoretical computer science. Motivated by this, we propose a relaxed version of the problem called ρ- approximate DBSCAN , which returns the same clusters as DBSCAN, unless the clusters are ”unstable„ (i.e., they change once the input parameters are slightly perturbed). The ρ-approximate problem can be settled in O ( n ) expected time regardless of the constant dimensionality d . The article also enhances the previous result on the exact DBSCAN problem in 2D space. We show that, if the n data points have been pre-sorted on each dimension (i.e., one sorted list per dimension), the problem can be settled in O ( n ) worst-case time. As a corollary, when all the coordinates are integers, the 2D DBSCAN problem can be solved in O ( n log log n ) time deterministically, improving the existing O ( n log n ) bound.
Dynamic clustering-how to efficiently maintain data clusters along with updates in the underlying dataset-is a difficult topic. This is especially true for density-based clustering, where objects are aggregated based on transitivity of proximity, under which deciding the cluster(s) of an object may require the inspection of numerous other objects. The phenomenon is unfortunate, given the popular usage of this clustering approach in many applications demanding data updates.Motivated by the above, we investigate the algorithmic principles for dynamic clustering by DBSCAN, a successful representative of density-based clustering, and ρ-approximate DBSCAN, proposed to bring down the computational hardness of the former on static data. Surprisingly, we prove that the ρ-approximate version suffers from the very same hardness when the dataset is fully dynamic, namely, when both insertions and deletions are allowed. We also show that this issue goes away as soon as tiny further relaxation is applied, yet still ensuring the same quality-known as the "sandwich guarantee"-of ρ-approximate DBSCAN. Our algorithms guarantee near-constant update processing, and outperform existing approaches by a factor over two orders of magnitude.
DBSCAN is a popular method for clustering multi-dimensional objects. Just as notable as the method's vast success is the research community's quest for its efficient computation. The original KDD'96 paper claimed an algorithm with O(n log n) running time, where n is the number of objects. Unfortunately, this is a mis-claim; and that algorithm actually requires O(n 2 ) time. There has been a fix in 2D space, where a genuine O(n log n)-time algorithm has been found. Looking for a fix for dimensionality d ≥ 3 is currently an important open problem.In this paper, we prove that for d ≥ 3, the DBSCAN problem requires Ω(n 4/3 ) time to solve, unless very significant breakthroughs-ones widely believed to be impossible-could be made in theoretical computer science. This (i) explains why the community's search for fixing the aforementioned mis-claim has been futile for d ≥ 3, and (ii) indicates (sadly) that all DBSCAN algorithms must be intolerably slow even on moderately large n in practice. Surprisingly, we show that the running time can be dramatically brought down to O(n) in expectation regardless of the dimensionality d, as soon as slight inaccuracy in the clustering results is permitted. We formalize our findings into the new notion of ρ-approximate DBSCAN, which we believe should replace DBSCAN on big data due to the latter's computational intractability.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.