Christian Sohler scite author profile

We prove that the sum of the squared Euclidean distances from the n rows of an n × d matrix A to any compact set that is spanned by k vectors in R d can be approximated up to (1 + ε)-factor, for an arbitrary small ε > 0, using the O(k/ε 2 )-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1 + ε)-approximated by an optimal k-means clustering of their projection on the O(k/ε 2 ) first right singular vectors (principle components) of A.A (j, k)-coreset for projective clustering is a small set of points that yields a (1 + ε)-approximation to the sum of squared distances from the n rows of A to any set of k affine subspaces, each of dimension at most j. Our embedding yields (0, k)-coresets of size O(k) for handling k-means queries, (j, 1)-coresets of size O(j) for PCA queries, and (j, k)-coresets of size (log n) O(jk) for any j, k ≥ 1 and constant ε ∈ (0, 1/2). Previous coresets usually have a size which is linearly or even exponentially dependent of d, which makes them useless when d ∼ n.Using our coresets with the merge-and-reduce approach, we obtain embarrassingly parallel streaming algorithms for problems such as k-means, PCA and projective clustering. These algorithms use update time per point and memory that is polynomial in log n and only linear in d.For cost functions other than squared Euclidean distances we suggest a simple recursive coreset construction that produces coresets of size k

show abstract

Clustering for metric and nonmetric distance measures

Ackermann

Blömer

Sohler

2010

ACM Trans. Algorithms

242

View full text Add to dashboard Cite

We study a generalization of the k -median problem with respect to an arbitrary dissimilarity measure D. Given a finite set P of size n , our goal is to find a set C of size k such that the sum of errors D( P,C ) = ∑ p ∈ P min c ∈ C {D( p,c )} is minimized. The main result in this article can be stated as follows: There exists a (1+ϵ)-approximation algorithm for the k -median problem with respect to D, if the 1-median problem can be approximated within a factor of (1+ϵ) by taking a random sample of constant size and solving the 1-median problem on the sample exactly. This algorithm requires time n 2 O ( mk log( mk /ϵ)), where m is a constant that depends only on ϵ and D. Using this characterization, we obtain the first linear time (1+ϵ)-approximation algorithms for the k -median problem in an arbitrary metric space with bounded doubling dimension, for the Kullback-Leibler divergence (relative entropy), for the Itakura-Saito divergence, for Mahalanobis distances, and for some special cases of Bregman divergences. Moreover, we obtain previously known results for the Euclidean k -median problem and the Euclidean k -means problem in a simplified manner. Our results are based on a new analysis of an algorithm of Kumar et al. [2004].

show abstract

StreamKM++: A Clustering Algorithm for Data Streams

Ackermann¹,

Lammersen²,

Märtens³

et al. 2010

110

194

View full text Add to dashboard Cite

Counting triangles in data streams

et al. 2006

View full text Add to dashboard Cite

We present two space bounded random sampling algorithms that compute an approximation of the number of triangles in an undirected graph given as a stream of edges. Our first algorithm does not make any assumptions on the order of edges in the stream. It uses space that is inversely related to the ratio between the number of triangles and the number of triples with at least one edge in the induced subgraph, and constant expected update time per edge. Our second algorithm is designed for incidence streams (all edges incident to the same vertex appear consecutively). It uses space that is inversely related to the ratio between the number of triangles and length 2 paths in the graph and expected update time O(log |V | · (1 + s · |V |/|E|)), where s is the space requirement of the algorithm. These results significantly improve over previous work [20,8]. Since the space complexity depends only on the structure of the input graph and not on the number of nodes, our algorithms scale very well with increasing graph size and so they provide a basic tool to analyze the structure of large graphs. They have many applications, for example, in the discovery of Web communities, the computa- * This work was partially supported by the EU within the 6th Framework Programme under contract 001907 "Dynamically Evolving, Large Scale Information Systems" (DELIS) * Part of this work was done while the author was post-doc at Universitá degli Studi di Roma "La Sapienza" † Part of this work was done while the author was visiting the School of Computer Science at Carnegie Mellon UniversityPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. tion of clustering and transitivity coefficient, and discovery of frequent patterns in large graphs.We have implemented both algorithms and evaluated their performance on networks from different application domains. The sizes of the considered graphs varied from about 8, 000 nodes and 40, 000 edges to 135 million nodes and more than 1 billion edges. For both algorithms we run experiments with parameter s = 1, 000, 10, 000, 100, 000, 1, 000, 000 to evaluate running time and approximation guarantee. Both algorithms appear to be time efficient for these sample sizes. The approximation quality of the first algorithm was varying significantly and even for s = 1, 000, 000 we had more than 10% deviation for more than half of the instances. The second algorithm performed much better and even for s = 10, 000 we had an average deviation of less than 6% (taken over all but the largest instance for which we could not compute the number of triangles exactly).

show abstract

A PTAS for k-means clustering based on weak coresets

2007

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.