MP-Boost: Minipatch Boosting via Adaptive Feature and Observation Sampling

Toghani, Mohammad Taha; Allen, Genevera I.

doi:10.1109/bigcomp51126.2021.00023

Cited by 6 publications

(7 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We also use the FairStacks meta-learner to construct several weighted ensembles, shown as solid lines in Figure 2: i) random forest trees, i.e., the elements of a random forest ensemble; ii) an ensemble of 1000 minipatch decision trees, i.e. decision trees trained on small sub-samples of features and observations [42,38]; iii) learners implemented in the Scikit-Learn and XGBoost packages [35,7]; and iv) a kitchen-sink ensemble consisting of both the stand-alone methods and the previous ensembles.…”

Section: Empirical Studiesmentioning

confidence: 99%

To the Fairness Frontier and Beyond: Identifying, Quantifying, and Optimizing the Fairness-Accuracy Pareto Frontier

Little¹,

Weylandt²,

Allen³

2022

Preprint

View full text Add to dashboard Cite

Algorithmic fairness has emerged as an important consideration when developing and deploying machine learning models to make high-stakes societal decisions. Yet, improved fairness often comes at the expense of model accuracy. While aspects of the fairness-accuracy tradeoff have been studied, most work reports the fairness and accuracy of various models separately; this makes model comparisons nearly impossible without a unified model-agnostic metric that reflects the Pareto optimal balance of the two desiderata. In this paper, we seek to identify, quantify, and optimize the empirical Pareto frontier of the fairness-accuracy tradeoff, defined as the highest attained accuracy at every level of fairness for a collection of fitted models. Specifically, we identify and outline the empirical Pareto frontier through our Tradeoff-between-Fairness-and-Accuracy (taf) Curves; we then develop a single metric to quantify this Pareto frontier through the weighted area under the taf Curve which we term the Fairness-Area-Under-the-Curve (fauc). Our taf Curves provide the first empirical, model-agnostic characterization of the Pareto frontier, while our fauc provides the first unified metric to impartially compare model families in terms of both fairness and accuracy. Both taf Curves and fauc are general and can be employed with all group fairness definitions and accuracy measures. Next, we ask: Is it possible to expand the empirical Pareto frontier and thus improve the fauc for a given collection of fitted models? We answer in the affirmative by developing a novel fair model stacking framework, FairStacks. FairStacks solves a convex program to maximize the accuracy of a linear combination of fitted models subject to a constraint on score-based model bias. We show that optimizing with FairStacks always expands the empirical Pareto frontier and improves the fauc; we additionally study other theoretical properties of our proposed approach. Finally, we empirically validate taf, fauc, and FairStacks through studies on several real benchmark data sets, showing that FairStacks leads to major improvements in fauc that outperform existing algorithmic fairness approaches.

show abstract

Section: Empirical Studiesmentioning

confidence: 99%

To the Fairness Frontier and Beyond: Identifying, Quantifying, and Optimizing the Fairness-Accuracy Pareto Frontier

Little¹,

Weylandt²,

Allen³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Inspired by the subsampling approach of J+aB, our inferential approach is rooted in an ensemble built by taking tiny random subsamples of both observations and features in tabular data. This idea of double subsampling appears first in the context of random forests [41,27], linear regression [37], and more recently has been termed "minipatch ensembles" by [65,80]. We adopt this idea of minipatch ensembles and we are the first to develop inferential approaches using this approach.…”

Section: Related Workmentioning

confidence: 99%

Model-Agnostic Confidence Intervals for Feature Importance: A Fast and Powerful Approach Using Minipatch Ensembles

Gan¹,

Zheng²,

Allen³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

In order to trust machine learning for high-stakes problems, we need models to be both reliable and interpretable. Recently, there has been a growing body of work on interpretable machine learning which generates human understandable insights into data, models, or predictions. At the same time, there has been increased interest in quantifying the reliability and uncertainty of machine learning predictions, often in the form of confidence intervals for predictions using conformal inference. Yet, there has been relatively little attention given to the reliability and uncertainty of machine learning interpretations, which is the focus of this paper. Our goal is to develop confidence intervals for a widely-used form of machine learning interpretation: feature importance. We specifically seek to develop universal model-agnostic and assumption-light confidence intervals for feature importance that will be valid for any machine learning model and for any regression or classification task. We do so by leveraging a form of random observation and feature subsampling called minipatch ensembles and show that our approach provides assumption-light asymptotic coverage for the feature importance score of any model. Further, our approach is fast as computations needed for inference come nearly for free as part of the ensemble learning process. Finally, we also show that our same procedure can be leveraged to provide valid confidence intervals for predictions, hence providing fast, simultaneous quantification of the uncertainty of both model predictions and interpretations. We validate our intervals on a series of synthetic and real data examples, showing that our approach detects the correct important features and exhibits many computational and statistical advantages over existing methods. * Equal contribution.

show abstract

“…Specifically, we seek to improve computational efficiency, provide interpretability in terms of feature importance, and at the same time improve clustering accuracy. We achieve these goals by leveraging the idea of minipatch learning [Yao and Allen, 2020, Yao et al, 2021, Toghani and Allen, 2021 which is an ensemble of learners trained on tiny subsamples of both observations and features. Compared to only subsampling observations in existing consensus clustering ensembles, by learning on many tiny data sets, our approach offers dramatic computational savings.…”

Section: Contributionsmentioning

confidence: 99%

“…While the core of our approach is identical to that of consensus clustering, we offer three major methodological innovations in Steps 1 and 2 of Algorithm 1 that yield dramatically faster, more accurate, and interpretable results. Our first innovation is building cluster ensembles based on tiny subsets (typically 10% or less) of both observations and features termed minipatches [Yao and Allen, 2020, Yao et al, 2021, Toghani and Allen, 2021. Note that existing consensus clustering approaches form ensembles by subsampling typically 80% of observations and all the features for each ensemble member [Wilkerson and Hayes, 2010].…”

Section: Minipatch Consensus Clusteringmentioning

confidence: 99%

Fast and Interpretable Consensus Clustering via Minipatch Learning

Gan,

Allen

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Consensus clustering has been widely used in bioinformatics and other applications to improve the accuracy, stability and reliability of clustering results. This approach ensembles cluster co-occurrences from multiple clustering runs on subsampled observations. For application to large-scale bioinformatics data, such as to discover cell types from single-cell sequencing data, for example, consensus clustering has two significant drawbacks: (i) computational inefficiency due to repeatedly applying clustering algorithms, and (ii) lack of interpretability into the important features for differentiating clusters. In this paper, we address these two challenges by developing IMPACC: Interpretable Mini-Patch Adaptive Consensus Clustering. Our approach adopts three major innovations. We ensemble cluster co-occurrences from tiny subsets of both observations and features, termed minipatches, thus dramatically reducing computation time. Additionally, we develop adaptive sampling schemes for observations, which result in both improved reliability and computational savings, as well as adaptive sampling schemes of features, which leads to interpretable solutions by quickly learning the most relevant features that differentiate clusters. We study our approach on synthetic data and a variety of real large-scale bioinformatics data sets; results show that our approach not only yields more accurate and interpretable cluster solutions, but it also substantially improves computational efficiency compared to standard consensus clustering approaches.

show abstract

MP-Boost: Minipatch Boosting via Adaptive Feature and Observation Sampling

Cited by 6 publications

References 17 publications

To the Fairness Frontier and Beyond: Identifying, Quantifying, and Optimizing the Fairness-Accuracy Pareto Frontier

To the Fairness Frontier and Beyond: Identifying, Quantifying, and Optimizing the Fairness-Accuracy Pareto Frontier

Model-Agnostic Confidence Intervals for Feature Importance: A Fast and Powerful Approach Using Minipatch Ensembles

Fast and Interpretable Consensus Clustering via Minipatch Learning

Contact Info

Product

Resources

About