Core‐sets: An updated survey

Feldman, Dan

doi:10.1002/widm.1335

Cited by 39 publications

(27 citation statements)

References 91 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The primary application of coresets is to create a compact representation of a large dataset, to allow for fast inference on downstream tasks (see [28] for a recent survey). However, such compact representations have also proved beneficial in interpretation of both models and datasets.…”

Section: Coresets For Understanding Datasets and Modelsmentioning

confidence: 99%

Understanding Collections of Related Datasets Using Dependent MMD Coresets

Williamson

Henderson²

2021

Information

View full text Add to dashboard Cite

Understanding how two datasets differ can help us determine whether one dataset under-represents certain sub-populations, and provides insights into how well models will generalize across datasets. Representative points selected by a maximum mean discrepancy (MMD) coreset can provide interpretable summaries of a single dataset, but are not easily compared across datasets. In this paper, we introduce dependent MMD coresets, a data summarization method for collections of datasets that facilitates comparison of distributions. We show that dependent MMD coresets are useful for understanding multiple related datasets and understanding model generalization between such datasets.

show abstract

Section: Coresets For Understanding Datasets and Modelsmentioning

confidence: 99%

Understanding Collections of Related Datasets Using Dependent MMD Coresets

Williamson

Henderson²

2021

Information

View full text Add to dashboard Cite

show abstract

“…A different approach that is to use data summarization techniques. Coresets in particular were first used to solve problems in computational geometry [1] and got increasing attention in both the industry [3,4,5,17,35] and academy [6,8,23,24] over the recent years; see surveys in [20,44,47]. Informally, coreset is a small weighted subset of the input points (unlike e.g.…”

Section: Modern Machine Learningmentioning

confidence: 99%

“…The size of the coreset is usually polynomial in 1/ε but independent or near-logarithmic in the size of the input. Since such a coreset approximates every query (and not just the optimal one), it supports constraint optimization, and the above computation models using merge-and-reduce trees; see details in [20]. Moreover, coresets may be computed in time that is near-linear in the input, even for NP-hard optimization problems.…”

Section: Modern Machine Learningmentioning

confidence: 99%

Coresets for Near-Convex Functions

Tukan,

Maalouf,

Feldman

2020

Preprint

Self Cite

View full text Add to dashboard Cite

Coreset is usually a small weighted subset of n input points in R d , that provably approximates their loss function for a given set of queries (models, classifiers, etc.). Coresets become increasingly common in machine learning since existing heuristics or inefficient algorithms may be improved by running them possibly many times on the small coreset that can be maintained for streaming distributed data. Coresets can be obtained by sensitivity (importance) sampling, where its size is proportional to the total sum of sensitivities. Unfortunately, computing the sensitivity of each point is problem dependent and may be harder to compute than the original optimization problem at hand. We suggest a generic framework for computing sensitivities (and thus coresets) for wide family of loss functions which we call near-convex functions. This is by suggesting the f -SVD factorization that generalizes the SVD factorization of matrices to functions. Example applications include coresets that are either new or significantly improves previous results, such as SVM, Logistic regression, M-estimators, and z -regression. Experimental results and open source are also provided.Preprint. Under review.

show abstract

“…While this overview covers the accurate coreset constructions in literature, as well as the required background for understanding their correctness and proofs, there are a small number of other, different but related, overviews and surveys. The most related recent survey we are aware of is (Feldman, 2020), which: (a) covers the main advantages and disadvantages of (both accurate and non‐accurate) coresets in general, (b) discusses the applications that coresets can make feasible (e.g., the streaming and distributed data models), (c) discusses the different coreset types, and (d) dives in detail into a general framework for (non‐accurate) coreset construction. However, it does not provide the background needed for understanding those techniques and algorithms, and does not discuss concrete coreset construction algorithms and their correctness.…”

Section: Introductionmentioning

confidence: 99%

Overview of accurate coresets

Jubran

Maalouf

Feldman

2021

WIREs Data Min & Knowl

Self Cite

View full text Add to dashboard Cite

A coreset of an input set is its small summarization, such that solving a problem on the coreset as its input, provably yields the same result as solving the same problem on the original (full) set, for a given family of problems (models/classifiers/loss functions). Coresets have been suggested for many fundamental problems, for example, in machine/deep learning, computer vision, databases, and theoretical computer science. This introductory paper was written following requests regarding the many inconsistent coreset definitions, lack of source code, the required deep theoretical background from different fields, and the dense papers that make it hard for beginners to apply and develop coresets. The article provides folklore, classic, and simple results including step‐by‐step proofs and figures, for the simplest (accurate) coresets. Nevertheless, we did not find most of their constructions in the literature. Moreover, we expect that putting them together in a retrospective context would help the reader to grasp current results that usually generalize these fundamental observations. Experts might appreciate the unified notation and comparison table for existing results. Open source code is provided for all presented algorithms, to demonstrate their usage, and to support the readers who are more familiar with programming than mathematics. This article is categorized under: Algorithmic Development > Structure Discovery Fundamental Concepts of Data and Knowledge > Big Data Mining Technologies > Machine Learning

show abstract

Core‐sets: An updated survey

Cited by 39 publications

References 91 publications

Understanding Collections of Related Datasets Using Dependent MMD Coresets

Understanding Collections of Related Datasets Using Dependent MMD Coresets

Coresets for Near-Convex Functions

Overview of accurate coresets

Contact Info

Product

Resources

About