Complaint-driven Training Data Debugging for Query 2.0

Wu, Weiyuan; Flokas, Lampros; Wu, Eugene; Wang, Jiannan

doi:10.1145/3318464.3389696

Cited by 33 publications

(12 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Influence functions are a classic technique from robust statistics that measure how optimal model parameters depend on training data instances. Based on first-order influence functions, [95] introduced an approach for identifying training data points that are responsible for user constraints specified by an SQL query. There are other methods than influence functions, however, e.g, [98] develops an approach for identifying and ranking training data points based on their influence on predictions of neural networks, and [96] develops an approach for incremental computation of the influence of removing subset of training data points Furthermore, other recent work argues for the use of data Shapley values to quantify the contribution of individual data instances [33,34,54]; these approaches are computationally expensive because each data instance requires the model to be retrained.…”

Section: Related Workmentioning

confidence: 99%

Interpretable Data-Based Explanations for Fairness Debugging

Pradhan¹,

Zhu²,

Glavic³

et al. 2021

Preprint

View full text Add to dashboard Cite

A wide variety of fairness metrics and eXplainable Artificial Intelligence (XAI) approaches have been proposed in the literature to identify bias in machine learning models that are used in critical real-life contexts. However, merely reporting on a model's bias, or generating explanations using existing XAI techniques is insufficient to locate and eventually mitigate sources of bias. In this work, we introduce Gopher, a system that produces compact, interpretable and causal explanations for bias or unexpected model behavior by identifying coherent subsets of the training data that are root-causes for this behavior. Specifically, we introduce the concept of causal responsibility that quantifies the extent to which intervening on training data by removing or updating subsets of it can resolve the bias. Building on this concept, we develop an efficient approach for generating the top-𝑘 patterns that explain model bias that utilizes techniques from the ML community to approximate causal responsibility and uses pruning rules to manage the large search space for patterns. Our experimental evaluation demonstrates the effectiveness of Gopher in generating interpretable explanations for identifying and debugging sources of bias.

show abstract

Section: Related Workmentioning

confidence: 99%

Interpretable Data-Based Explanations for Fairness Debugging

Pradhan¹,

Zhu²,

Glavic³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Although they can generate a predicate explanation over the inference data, the provenance analysis does not extend across model training nor UDFs, which are prevalent in data science workflows. The recent system Rain [45] generates fine-grained explanations for inference queries. It relaxes the inference query into a differentiable function over the model's prediction probabilities, and leverages influence analysis [22] to estimate the query result's sensitivity to a training record.…”

Section: Sql Explainmentioning

confidence: 99%

“…However, Rain returns training records rather than predicates, and estimating the model prediction sensitivity to group-wise changes to the training data remains an open problem. Further, Rain does not currently support UDFs and uses a white-box approach that is less amenable to data science programs (Figure 1 As a first approach towards addressing the above limitations, and to diverge from existing white-box explanation approaches [1,33,34,44,45], this paper explores a black-box approach towards inference query explanation. BOExplain models inference query explanation as a hyperparameter tuning problem and adopts Bayesian Optimization (BO) to solve it.…”

Section: Sql Explainmentioning

confidence: 99%

See 1 more Smart Citation

Explaining Inference Queries with Bayesian Optimization

Lockhart¹,

Peng²,

Wu³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Obtaining an explanation for an SQL query result can enrich the analysis experience, reveal data errors, and provide deeper insight into the data. Inference query explanation seeks to explain unexpected aggregate query results on inference data; such queries are challenging to explain because an explanation may need to be derived from the source, training, or inference data in an ML pipeline. In this paper, we model an objective function as a black-box function and propose BOExplain, a novel framework for explaining inference queries using Bayesian optimization (BO). An explanation is a predicate defining the input tuples that should be removed so that the query result of interest is significantly affected. BO -a technique for finding the global optimum of a black-box function -is used to find the best predicate. We develop two new techniques (individual contribution encoding and warm start) to handle categorical variables. We perform experiments showing that the predicates found by BOExplain have a higher degree of explanation compared to those found by the state-of-the-art query explanation engines. We also show that BOExplain is effective at deriving explanations for inference queries from source and training data on three real-world datasets.

show abstract

“…14, No. 11 ISSN 2150-8097. doi:10.14778/3476249.3476270 Snorkel [59], ZeroER [77], TFX [6,51], "Query 2.0" [78], Krypton [54], Cerebro [55], ModelDB [72], MLFlow [80], DeepDive [81], HoloClean [60], ActiveClean [40], and NorthStar [39]. End-to-end AutoML systems [26,79,82] have been an emerging type of systems that has significantly raised the level of abstractions of building ML applications.…”

Section: Introductionmentioning

confidence: 99%

VolcanoML: Speeding up End-to-End AutoML via Scalable Search Space Decomposition

Li,

Shen,

Zhang

et al. 2021

Preprint

View full text Add to dashboard Cite

End-to-end AutoML has attracted intensive interests from both academia and industry, which automatically searches for ML pipelines in a space induced by feature engineering, algorithm/model selection, and hyper-parameter tuning. Existing AutoML systems, however, suffer from scalability issues when applying to application domains with large, high-dimensional search spaces. We present VolcanoML, a scalable and extensible framework that facilitates systematic exploration of large AutoML search spaces. VolcanoML introduces and implements basic building blocks that decompose a large search space into smaller ones, and allows users to utilize these building blocks to compose an execution plan for the AutoML problem at hand. VolcanoML further supports a Volcanostyle execution model -akin to the one supported by modern database systems -to execute the plan constructed. Our evaluation demonstrates that, not only does VolcanoML raise the level of expressiveness for search space decomposition in AutoML, it also leads to actual findings of decomposition strategies that are significantly more efficient than the ones employed by state-of-the-art AutoML systems such as auto-sklearn.

show abstract

Complaint-driven Training Data Debugging for Query 2.0

Cited by 33 publications

References 39 publications

Interpretable Data-Based Explanations for Fairness Debugging

Interpretable Data-Based Explanations for Fairness Debugging

Explaining Inference Queries with Bayesian Optimization

VolcanoML: Speeding up End-to-End AutoML via Scalable Search Space Decomposition

Contact Info

Product

Resources

About