Deterministic Coresets for k-Means of Big Sparse Data

Barger, Artem; Feldman, Dan

doi:10.3390/a13040092

Cited by 6 publications

(4 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[31] used a two-stage strategy to clustering, using hierarchical and non-hierarchical clustering, and came to the same conclusion about getting better outcomes. [32] suggested using self-organizing charts (i.e. model-oriented) to evaluate the clusters using the k-means algorithm.…”

Section: Two-stage Clustering and Data Sizementioning

confidence: 99%

Survey on Technique and User Profiling in Unsupervised Machine Learning Method

Kristijansson

Aegisson

2022

JMC

View full text Add to dashboard Cite

In order to generate precise behavioural patterns or user segmentation, organisations often struggle with pulling information from data and choosing suitable Machine Learning (ML) techniques. Furthermore, many marketing teams are unfamiliar with data-driven classification methods. The goal of this research is to provide a framework that outlines the Unsupervised Machine Learning (UML) methods for User-Profiling (UP) based on essential data attributes. A thorough literature study was undertaken on the most popular UML techniques and their dataset attributes needs. For UP, a structure is developed that outlines several UML techniques. In terms of data size and dimensions, it offers two-stage clustering algorithms for category, quantitative, and mixed types of datasets. The clusters are determined in the first step using a multilevel or model-based classification method. Cluster refining is done in the second step using a non-hierarchical clustering technique. Academics and professionals may use the framework to figure out which UML techniques are best for creating strong profiles or data-driven user segmentation.

show abstract

Section: Two-stage Clustering and Data Sizementioning

confidence: 99%

Survey on Technique and User Profiling in Unsupervised Machine Learning Method

Kristijansson

Aegisson

2022

JMC

View full text Add to dashboard Cite

show abstract

“…However, these coresets are exponential in the dimension

d

of the input. Recently a deterministic coreset of size independent of

d

was suggested (Barger & Feldman, 2020).…”

Section: Accurate Coresetsmentioning

confidence: 99%

Overview of accurate coresets

Jubran

Maalouf

Feldman

2021

WIREs Data Min & Knowl

Self Cite

View full text Add to dashboard Cite

A coreset of an input set is its small summarization, such that solving a problem on the coreset as its input, provably yields the same result as solving the same problem on the original (full) set, for a given family of problems (models/classifiers/loss functions). Coresets have been suggested for many fundamental problems, for example, in machine/deep learning, computer vision, databases, and theoretical computer science. This introductory paper was written following requests regarding the many inconsistent coreset definitions, lack of source code, the required deep theoretical background from different fields, and the dense papers that make it hard for beginners to apply and develop coresets. The article provides folklore, classic, and simple results including step‐by‐step proofs and figures, for the simplest (accurate) coresets. Nevertheless, we did not find most of their constructions in the literature. Moreover, we expect that putting them together in a retrospective context would help the reader to grasp current results that usually generalize these fundamental observations. Experts might appreciate the unified notation and comparison table for existing results. Open source code is provided for all presented algorithms, to demonstrate their usage, and to support the readers who are more familiar with programming than mathematics. This article is categorized under: Algorithmic Development > Structure Discovery Fundamental Concepts of Data and Knowledge > Big Data Mining Technologies > Machine Learning

show abstract

“…In particular, we can always improve a given k-clustering, by replacing the center of each cluster by its mean (if this is not already the case). This is indeed the idea behind the classic Lloyd's heuristic [Llo82] and also behind some coresets for k-means [BF20]. Most coreset construction algorithms for those hard problems usually borrow or generalize tricks and techniques used in coreset constructions for the (simpler) mean problem.…”

Section: Introductionmentioning

confidence: 99%

“…A coreset that introduces multiplicative 1 + ε error for this problem (SVD/linear regression) was suggested in [FVR16], also here the authors suggested a reduction to the problem of computing a mean coreset with multiplicative 1 + ε error for a set of point in a higher dimensional space. Another example is in the context of k-means, where [BF20] showed that in order to compute a k-means coreset for a set of points P it is suffices to cluster these points to a large number of clusters, and compute a mean coreset for each cluster, then take the union of these coresets to a single unite set, which is proven to be a k-means coreset for P .…”

Section: Introductionmentioning

confidence: 99%

Introduction to Coresets: Approximated Mean

Maalouf¹,

Jubran²,

Feldman³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

A strong coreset for the mean queries of a set P in R d is a small weighted subset C ⊆ P , which provably approximates its sum of squared distances to any center (point) x ∈ R d . A weak coreset is (also) a small weighted subset C of P , whose mean approximates the mean of P . While computing the mean of P can be easily computed in linear time, its coreset can be used to solve harder constrained version, and is in the heart of generalizations such as coresets for k-means clustering. In this paper, we survey most of the mean coreset construction techniques, and suggest a unified analysis methodology for providing and explaining classical and modern results including step-by-step proofs. In particular, we collected folklore and scattered related results, some of which are not formally stated elsewhere.Throughout this survey, we present, explain, and prove a set of techniques, reductions, and algorithms very widespread and crucial in this field. However, when put to use in the (relatively simple) mean problem, such techniques are much simpler to grasp. The survey may help guide new researchers unfamiliar with the field, and introduce them to the very basic foundations of coresets, through a simple, yet fundamental, problem. Experts in this area might appreciate the unified analysis flow, and the comparison table for existing results. Finally, to encourage and help practitioners and software engineers, we provide full open source code for all presented algorithms.

show abstract

Deterministic Coresets for k-Means of Big Sparse Data

Cited by 6 publications

References 18 publications

Survey on Technique and User Profiling in Unsupervised Machine Learning Method

Survey on Technique and User Profiling in Unsupervised Machine Learning Method

Overview of accurate coresets

Introduction to Coresets: Approximated Mean

Contact Info

Product

Resources

About