Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data 2020
DOI: 10.1145/3318464.3389696
|View full text |Cite
|
Sign up to set email alerts
|

Complaint-driven Training Data Debugging for Query 2.0

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
12
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 33 publications
(12 citation statements)
references
References 39 publications
0
12
0
Order By: Relevance
“…Influence functions are a classic technique from robust statistics that measure how optimal model parameters depend on training data instances. Based on first-order influence functions, [95] introduced an approach for identifying training data points that are responsible for user constraints specified by an SQL query. There are other methods than influence functions, however, e.g, [98] develops an approach for identifying and ranking training data points based on their influence on predictions of neural networks, and [96] develops an approach for incremental computation of the influence of removing subset of training data points Furthermore, other recent work argues for the use of data Shapley values to quantify the contribution of individual data instances [33,34,54]; these approaches are computationally expensive because each data instance requires the model to be retrained.…”
Section: Related Workmentioning
confidence: 99%
“…Influence functions are a classic technique from robust statistics that measure how optimal model parameters depend on training data instances. Based on first-order influence functions, [95] introduced an approach for identifying training data points that are responsible for user constraints specified by an SQL query. There are other methods than influence functions, however, e.g, [98] develops an approach for identifying and ranking training data points based on their influence on predictions of neural networks, and [96] develops an approach for incremental computation of the influence of removing subset of training data points Furthermore, other recent work argues for the use of data Shapley values to quantify the contribution of individual data instances [33,34,54]; these approaches are computationally expensive because each data instance requires the model to be retrained.…”
Section: Related Workmentioning
confidence: 99%
“…Although they can generate a predicate explanation over the inference data, the provenance analysis does not extend across model training nor UDFs, which are prevalent in data science workflows. The recent system Rain [45] generates fine-grained explanations for inference queries. It relaxes the inference query into a differentiable function over the model's prediction probabilities, and leverages influence analysis [22] to estimate the query result's sensitivity to a training record.…”
Section: Sql Explainmentioning
confidence: 99%
“…However, Rain returns training records rather than predicates, and estimating the model prediction sensitivity to group-wise changes to the training data remains an open problem. Further, Rain does not currently support UDFs and uses a white-box approach that is less amenable to data science programs (Figure 1 As a first approach towards addressing the above limitations, and to diverge from existing white-box explanation approaches [1,33,34,44,45], this paper explores a black-box approach towards inference query explanation. BOExplain models inference query explanation as a hyperparameter tuning problem and adopts Bayesian Optimization (BO) to solve it.…”
Section: Sql Explainmentioning
confidence: 99%
See 1 more Smart Citation
“…14, No. 11 ISSN 2150-8097. doi:10.14778/3476249.3476270 Snorkel [59], ZeroER [77], TFX [6,51], "Query 2.0" [78], Krypton [54], Cerebro [55], ModelDB [72], MLFlow [80], DeepDive [81], HoloClean [60], ActiveClean [40], and NorthStar [39]. End-to-end AutoML systems [26,79,82] have been an emerging type of systems that has significantly raised the level of abstractions of building ML applications.…”
Section: Introductionmentioning
confidence: 99%