Andrea Vattani scite author profile

Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k-means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing k-means that have mostly focused on the post-initialization phases of k-means. We prove that our proposed initialization algorithm k-means| | obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on realworld large-scale data demonstrates that k-means| | outperforms k-means++ in both sequential and parallel settings.

show abstract

Fast Greedy Algorithms in MapReduce and Streaming

Kumar

Moseley

Vassilvitskii

et al. 2015

ACM Trans. Parallel Comput.

150

171

View full text Add to dashboard Cite

Greedy algorithms are practitioners’ best friends—they are intuitive, are simple to implement, and often lead to very good solutions. However, implementing greedy algorithms in a distributed setting is challenging since the greedy choice is inherently sequential, and it is not clear how to take advantage of the extra processing power. Our main result is a powerful sampling technique that aids in parallelization of sequential algorithms. Armed with this primitive, we then adapt a broad class of greedy algorithms to the MapReduce paradigm; this class includes maximum cover and submodular maximization subject to p -system constraint problems. Our method yields efficient algorithms that run in a logarithmic number of rounds while obtaining solutions that are arbitrarily close to those produced by the standard sequential greedy algorithm. We begin with algorithms for modular maximization subject to a matroid constraint and then extend this approach to obtain approximation algorithms for submodular maximization subject to knapsack or p -system constraints.

show abstract

k-means Requires Exponentially Many Iterations Even in the Plane

Vattani

2011

Discrete Comput Geom

172

View full text Add to dashboard Cite

show abstract

k-means requires exponentially many iterations even in the plane

Vattani

2009

View full text Add to dashboard Cite

The k-means algorithm is a well-known method for partitioning n points that lie in the d-dimensional space into k clusters. Its main features are simplicity and speed in practice. Theoretically, however, the best known upper bound on its running time (i.e. O(n kd )) can be exponential in the number of points. Recently, Arthur and Vassilvitskii [2] showed a superpolynomial worst-case analysis, improving the best known lower bound from Ω(n) to 2with a construction in d = Ω( √ n) dimensions. In [2] they also conjectured the existence of super-polynomial lower bounds for any d ≥ 2.Our contribution is twofold: we prove this conjecture and we improve the lower bound, by presenting a simple construction in the plane that leads to the exponential lower bound 2 Ω(n) .

show abstract

Hiring a secretary from a poset

Kumar

Lattanzi

Vassilvitskii

et al. 2011

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Andrea Vattani

Scalable k-means++

Fast Greedy Algorithms in MapReduce and Streaming

k-means Requires Exponentially Many Iterations Even in the Plane

k-means requires exponentially many iterations even in the plane

Hiring a secretary from a poset

Contact Info

Product

Resources

About