An Algorithm for Online K-Means Clustering

Liberty, Edo; Sriharsha, Ram; Sviridenko, Maxim

doi:10.1137/1.9781611974317.7

Cited by 65 publications

(53 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Grinch makes heavy use of nearest neighbor search under the linkage function f . Rather than perform nearest neighbor search anew for each graft, when a data point arrives, we perform a single k-NN search (k ∈[25, 50]) and only consider these nodes during subsequent grafts (until the next data point arrives). Ablation.…”

mentioning

confidence: 99%

Scalable Hierarchical Clustering with Tree Grafting

Monath

Kobren

Krishnamurthy

et al. 2019

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining

View full text Add to dashboard Cite

We introduce Grinch, a new algorithm for large-scale, non-greedy hierarchical clustering with general linkage functions that compute arbitrary similarity between two point sets. The key components of Grinch are its rotate and graft subroutines that efficiently reconfigure the hierarchy as new points arrive, supporting discovery of clusters with complex structure. Grinch is motivated by a new notion of separability for clustering with linkage functions: we prove that when the model is consistent with a ground-truth clustering, Grinch is guaranteed to produce a cluster tree containing the ground-truth, independent of data arrival order. Our empirical results on benchmark and author coreference datasets (with standard and learned linkage functions) show that Grinch is more accurate than other scalable methods, and orders of magnitude faster than hierarchical agglomerative clustering. * The first two authors contributed equally.

show abstract

mentioning

confidence: 99%

Scalable Hierarchical Clustering with Tree Grafting

Monath

Kobren

Krishnamurthy

et al. 2019

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining

View full text Add to dashboard Cite

show abstract

“…Source code is at: https://github.com/tyler-hayes/ExStream. 2) Online k-means: This is a partitioning-based heuristic for an online variant of the traditional k-means clustering algorithm [49]. This heuristic is sometimes referred to as Learning Vector Quantization [43].…”

Section: Memory Efficient Rehearsalmentioning

confidence: 99%

Memory Efficient Experience Replay for Streaming Learning

Hayes

Cahill

Kanan

2019

2019 International Conference on Robotics and Automation (ICRA)

148

103

View full text Add to dashboard Cite

In supervised machine learning, an agent is typically trained once and then deployed. While this works well for static settings, robots often operate in changing environments and must quickly learn new things from data streams. In this paradigm, known as streaming learning, a learner is trained online, in a single pass, from a data stream that cannot be assumed to be independent and identically distributed (iid). Streaming learning will cause conventional deep neural networks (DNNs) to fail for two reasons: 1) they need multiple passes through the entire dataset; and 2) non-iid data will cause catastrophic forgetting. An old fix to both of these issues is rehearsal. To learn a new example, rehearsal mixes it with previous examples, and then this mixture is used to update the DNN. Full rehearsal is slow and memory intensive because it stores all previously observed examples, and its effectiveness for preventing catastrophic forgetting has not been studied in modern DNNs. Here, we describe the ExStream algorithm for memory efficient rehearsal and compare it to alternatives. We find that full rehearsal can eliminate catastrophic forgetting in a variety of streaming learning settings, with ExStream performing well using far less memory and computation.

show abstract

“…A natural approach is to use stochastic gradient methods to optimize the K-means cost [9,34]. Liberty et al [25] design an alternative online K-means algorithm that when processing a point, opts to start a new cluster if the point is far from the current centers. This idea draws inspiration from the algorithm of Charikar et al [11] for the online k-center problem, which also adjusts the current centers when a new point is far away.…”

Section: Related Workmentioning

confidence: 99%

A Hierarchical Algorithm for Extreme Clustering

Kobren

Monath

Krishnamurthy

et al. 2017

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

View full text Add to dashboard Cite

Many modern clustering methods scale well to a large number of data items, N , but not to a large number of clusters, K. This paper introduces PERCH, a new non-greedy algorithm for online hierarchical clustering that scales to both massive N and K-a problem setting we term extreme clustering. Our algorithm efficiently routes new data points to the leaves of an incrementally-built tree. Motivated by the desire for both accuracy and speed, our approach performs tree rotations for the sake of enhancing subtree purity and encouraging balancedness. We prove that, under a natural separability assumption, our non-greedy algorithm will produce trees with perfect dendrogram purity regardless of online data arrival order. Our experiments demonstrate that PERCH constructs more accurate trees than other tree-building clustering algorithms and scales well with both N and K, achieving a higher quality clustering than the strongest flat clustering competitor in nearly half the time. * The first two authors contributed equally.

show abstract

An Algorithm for Online K-Means Clustering

Cited by 65 publications

References 21 publications

Scalable Hierarchical Clustering with Tree Grafting

Scalable Hierarchical Clustering with Tree Grafting

Memory Efficient Experience Replay for Streaming Learning

A Hierarchical Algorithm for Extreme Clustering

Contact Info

Product

Resources

About