Hierarchical clustering is extensively used to organize high dimensional objects such as documents and images into a structure which can then be used in a multitude of ways. However, existing algorithms are limited in their application since the time complexity of agglomerative style algorithms can be as much as O(n 2 log n) where n is the number of objects. Furthermore the computation of similarity between such objects is itself time consuming given they are high dimension and even optimized built in functions found in MATLAB take the best part of a day to handle collections of just 10,000 objects on typical machines. In this paper we explore using angular hashing to hash objects with similar angular distance to the same hash bucket. This allows us to create hierarchies of objects within each hash bucket and to hierarchically cluster the hash buckets themselves. With our formal guarantees on the similarity of objects in the same bucket this leads to an elegant agglomerative algorithm with strong performance bounds. Our experimental results show that not only is our approach thousands of times faster than regular agglomerative algorithms but surprisingly the accuracy of our results is typically as good and can sometimes be substantially better.
The area of constrained clustering has been actively pursued for the last decade. A more recent extension that will be the focus of this paper is constrained hierarchical clustering which allows building user-constrained dendrograms/trees. Like all forms of constrained clustering, previous work on hierarchical constrained clustering uses simple constraints that are typically implemented in a procedural language. However, there exists mature results and packages in the fields of constraint satisfaction languages and solvers that the constrained clustering field has yet to explore. This work marks the first steps towards introducing constraints satisfaction languages/solvers into hierarchical constrained clustering. We make several significant contributions. We show how many existing and new constraints for hierarchical clustering, can be modeled as a Horn-SAT problem that is easily solvable in polynomial time and which allows their implementation in any number of declarative languages or efficient solvers. We implement our own solver for efficiency reasons. We then show how to formulate constrained hierarchical clustering in a flexible manner so that any number of algorithms, whose output is a dendrogram, can make use of the constraints.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.