Chris Schwiegelshohn scite author profile

We study fair clustering problems as proposed by Chierichetti et al. [CKLV17]. Here, points have a sensitive attribute and all clusters in the solution are required to be balanced with respect to it (to counteract any form of data-inherent bias). Previous algorithms for fair clustering do not scale well.We show how to model and compute so-called coresets for fair clustering problems, which can be used to significantly reduce the input data size. We prove that the coresets are composable [IMMM14] and show how to compute them in a streaming setting. Furthermore, we propose a variant of Lloyd's algorithm that computes fair clusterings and extend it to a fair k-means++ clustering algorithm. We implement these algorithms and provide empirical evidence that the combination of our approximation algorithms and the coreset construction yields a scalable algorithm for fair k-means clustering.

show abstract

Sublinear Estimation of Weighted Matchings in Dynamic Data Streams

Bury

Schwiegelshohn

2015

View full text Add to dashboard Cite

This paper presents an algorithm for estimating the weight of a maximum weighted matching by augmenting any estimation routine for the size of an unweighted matching. The algorithm is implementable in any streaming model including dynamic graph streams. We also give the first constant estimation for the maximum matching size in a dynamic graph stream for planar graphs (or any graph with bounded arboricity) usingÕ(n 4/5 ) space which also extends to weighted matching. Using previous results by Kapralov, Khanna, and Sudan (2014) we obtain a polylog(n) approximation for general graphs using polylog(n) space in random order streams, respectively. In addition, we give a space lower bound of Ω(n 1−ε ) for any randomized algorithm estimating the size of a maximum matching up to a 1 + O(ε) factor for adversarial streams.

show abstract

A new coreset framework for clustering

Cohen-Addad

Saulpic

Schwiegelshohn

2021

View full text Add to dashboard Cite

In all state-of-the-art sketching and coreset techniques for clustering, as well as in the best known fixed-parameter tractable approximation algorithms, randomness plays a key role. For the classic k-median and k-means problems, there are no known deterministic dimensionality reduction procedure or coreset construction that avoid an exponential dependency on the input dimension d, the precision parameter ε −1 or k. Furthermore, there is no coreset construction that succeeds with probability 1 − 1/n and whose size does not depend on the number of input points, n. This has led researchers in the area to ask what is the power of randomness for clustering sketches [Feldman WIREs Data Mining Knowl. Discov'20].Similarly, the best approximation ratio achievable deterministically without a complexity exponential in the dimension are 1 + √ 2 for k-median [Cohen-Addad, Esfandiari, Mirrokni, Narayanan, STOC'22] and 6.12903 for k-means [Grandoni, Ostrovsky, Rabani, Schulman, Venkat, Inf. Process. Lett. '22]. Those are the best results, even when allowing a complexity FPT in the number of clusters k: this stands in sharp contrast with the (1 + ε)-approximation achievable in that case, when allowing randomization.In this paper, we provide deterministic sketches constructions for clustering, whose size bounds are close to the best-known randomized ones. We show how to compute a dimension reduction onto ε −O(1) log k dimensions in time k O(ε −O(1) +log log k) poly(nd), and how to build a coreset of size O k 2 log 3 kε −O(1) in time 2 ε O(1) k log 3 k + k O(ε −O(1) +log log k) poly(nd). In the case where k is small, this answers an open question of [Feldman WIDM'20] and [Munteanu and Schwiegelshohn, Künstliche Intell. '18] on whether it is possible to efficiently compute coresets deterministically.We also construct a deterministic algorithm for computing (1+ε)-approximation to k-median and k-means in high dimensional Euclidean spaces in time 2 k 2 log 3 k/ε O(1) poly(nd), close to the best randomized complexity of 2 (k/ε) O(1) nd (see [Kumar, Sabharwal, Sen, JACM 10] and [Bhattacharya, Jaiswal, Kumar, TCS'18]).Furthermore, our new insights on sketches also yield a randomized coreset construction that uses uniform sampling, that immediately improves over the recent results of [Braverman et al. FOCS '22] by a factor k.

show abstract

BICO: BIRCH Meets Coresets for k-Means Clustering

Fichtenberger

Gillé

Schmidt

et al. 2013

View full text Add to dashboard Cite

We design a data stream algorithm for the k-means problem, called BICO, that combines the data structure of the SIGMOD Test of Time award winning algorithm BIRCH [27] with the theoretical concept of coresets for clustering problems. The k-means problem asks for a set C of k centers minimizing the sum of the squared distances from every point in a set P to its nearest center in C. In a data stream, the points arrive one by one in arbitrary order and there is limited storage space.BICO computes high quality solutions in a time short in practice. First, BICO computes a summary S of the data with a provable quality guarantee: For every center set C, S has the same cost as P up to a (1 + ε)-factor, i. e., S is a coreset. Then, it runs k-means ++ [5] on S.We compare BICO experimentally with popular and very fast heuristics (BIRCH, MacQueen [24]) and with approximation algorithms (Stream-KM ++ [2], StreamLS [16,26]) with the best known quality guarantees. We achieve the same quality as the approximation algorithms mentioned with a much shorter running time, and we get much better solutions than the heuristics at the cost of only a moderate increase in running time.

show abstract

Coresets-Methods and History: A Theoreticians Design Pattern for Approximation and Streaming Algorithms

Munteanu

Schwiegelshohn

2017

Künstl Intell

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.