Coresets-Methods and History: A Theoreticians Design Pattern for Approximation and Streaming Algorithms

Munteanu, Alexander; Schwiegelshohn, Chris

doi:10.1007/s13218-017-0519-3

Cited by 50 publications

(43 citation statements)

References 75 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Coreset construction techniques (Bachem et al, 2017 ; Munteanu and Schwiegelshohn, 2018 ) seek to create a “summary” weighted sample of a dataset with the property that a model learned on this dataset approximates one learned on the complete dataset. Here too, the difference in objectives is that we focus on small models, ignore training data size, and are interested in outperforming a model learned on the complete data.…”

Section: Overviewmentioning

confidence: 99%

Interpretability With Accurate Small Models

Ghose¹,

Ravindran²

2020

Front. Artif. Intell.

View full text Add to dashboard Cite

Models often need to be constrained to a certain size for them to be considered interpretable. For example, a decision tree of depth 5 is much easier to understand than one of depth 50. Limiting model size, however, often reduces accuracy. We suggest a practical technique that minimizes this trade-off between interpretability and classification accuracy. This enables an arbitrary learning algorithm to produce highly accurate small-sized models. Our technique identifies the training data distribution to learn from that leads to the highest accuracy for a model of a given size. We represent the training distribution as a combination of sampling schemes. Each scheme is defined by a parameterized probability mass function applied to the segmentation produced by a decision tree. An Infinite Mixture Model with Beta components is used to represent a combination of such schemes. The mixture model parameters are learned using Bayesian Optimization. Under simplistic assumptions, we would need to optimize for O(d) variables for a distribution over a d-dimensional input space, which is cumbersome for most real-world data. However, we show that our technique significantly reduces this number to a fixed set of eight variables at the cost of relatively cheap preprocessing. The proposed technique is flexible: it is model-agnostic, i.e., it may be applied to the learning algorithm for any model family, and it admits a general notion of model size. We demonstrate its effectiveness using multiple real-world datasets to construct decision trees, linear probability models and gradient boosted models with different sizes. We observe significant improvements in the F1-score in most instances, exceeding an improvement of 100% in some cases.

show abstract

Section: Overviewmentioning

confidence: 99%

Interpretability With Accurate Small Models

Ghose¹,

Ravindran²

2020

Front. Artif. Intell.

View full text Add to dashboard Cite

show abstract

“…), which turns out in fact to be a strong requirement. The reader interested in an overview of coreset construction techniques is referred to the recent review [99].…”

Section: Definitionmentioning

confidence: 99%

Approximating Spectral Clustering via Sampling: A Review

Tremblay

Loukas

2019

Unsupervised and Semi-Supervised Learning

View full text Add to dashboard Cite

Spectral clustering refers to a family of unsupervised learning algorithms that compute a spectral embedding of the original data based on the eigenvectors of a similarity graph. This non-linear transformation of the data is both the key of these algorithms' success and their Achilles heel: forming a graph and computing its dominant eigenvectors can indeed be computationally prohibitive when dealing with more that a few tens of thousands of points. In this paper, we review the principal research efforts aiming to reduce this computational cost. We focus on methods that come with a theoretical control on the clustering performance and incorporate some form of sampling in their operation. Such methods abound in the machine learning, numerical linear algebra, and graph signal processing literature and, amongst others, include Nyström-approximation, landmarks, coarsening, coresets, and compressive spectral clustering. We present the approximation guarantees available for each and discuss practical merits and limitations. Surprisingly, despite the breadth of the literature explored, we conclude that there is still a gap between theory and practice: the most scalable methods are only intuitively motivated or loosely controlled, whereas those that come with end-to-end guarantees rely on strong assumptions or enable a limited gain of computation time.

show abstract

“…Meanwhile, research on data summarization has inspired a third approach: collecting data summaries. Data summaries, e.g., coresets, sketches, projections [18], [19], [20], are derived datasets that are much smaller than the original dataset, and can hence be transferred to a central location with a low communication overhead. This approach has been adopted in recent works, e.g., [6], [7], [8], [21].…”

Section: A Related Workmentioning

confidence: 99%

“…Because of the dependence on the cost function (Definition II.1), existing coreset construction algorithms are tailormade for specific machine learning problems. Here we briefly summarize common approaches for coreset construction and representative algorithms, and refer to [18], [19] for detailed surveys.…”

Section: B Coreset Construction Algorithmsmentioning

confidence: 99%

See 1 more Smart Citation

Robust Coreset Construction for Distributed Machine Learning

et al. 2019

2019 IEEE Global Communications Conference (GLOBECOM)

View full text Add to dashboard Cite

Motivated by the need of solving machine learning problems over distributed datasets, we explore the use of coreset to reduce the communication overhead. Coreset is a summary of the original dataset in the form of a small weighted set in the same sample space. Compared to other data summaries, coreset has the advantage that it can be used as a proxy of the original dataset, potentially for different applications. However, existing coreset construction algorithms are each tailor-made for a specific machine learning problem. Thus, to solve different machine learning problems, one has to collect coresets of different types, defeating the purpose of saving communication overhead. We resolve this dilemma by developing coreset construction algorithms based on k-means/median clustering, that give a guaranteed approximation for a broad range of machine learning problems with sufficiently continuous cost functions. Through evaluations on diverse datasets and machine learning problems, we verify the robust performance of the proposed algorithms.

show abstract

Coresets-Methods and History: A Theoreticians Design Pattern for Approximation and Streaming Algorithms

Cited by 50 publications

References 75 publications

Interpretability With Accurate Small Models

Interpretability With Accurate Small Models

Approximating Spectral Clustering via Sampling: A Review

Robust Coreset Construction for Distributed Machine Learning

Contact Info

Product

Resources

About