A general bootstrap performance diagnostic

Kleiner, Ariel; Talwalkar, Ameet; Agarwal, Sameer; Stoica, Ion; Jordan, Michael I.

doi:10.1145/2487575.2487650

Cited by 17 publications

(30 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We call such a procedure a diagnostic. We are going to use a diagnostic recently developed by Kleiner et al [22], but first let us provide some intuition for how a diagnostic should work. We consider first an impractical ideal diagnostic.…”

Section: Diagnosismentioning

confidence: 99%

“…Algorithm 1 is the diagnostic of Kleiner et al [22], specialized to query approximation. We provide it here for completeness.…”

Section: Appendix a Diagnostic Algorithmmentioning

confidence: 99%

“…The past decade has seen several investigations of such diagnostic methods in the statistics literature [12,22]. Recently, Kleiner et al [22] designed a diagnostic algorithm to detect when bootstrapbased error estimates are unreliable.…”

Section: Introductionmentioning

confidence: 99%

“…Recently, Kleiner et al [22] designed a diagnostic algorithm to detect when bootstrapbased error estimates are unreliable. However, their algorithm was not designed with computational considerations in mind and in-volves tens of thousands of test query executions, making it prohibitively slow in practice.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Knowing when you're wrong

Agarwal

Milner

Kleiner

et al. 2014

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Self Cite

118

View full text Add to dashboard Cite

Modern data analytics applications typically process massive amounts of data on clusters of tens, hundreds, or thousands of machines to support near-real-time decisions. The quantity of data and limitations of disk and memory bandwidth often make it infeasible to deliver answers at interactive speeds. However, it has been widely observed that many applications can tolerate some degree of inaccuracy. This is especially true for exploratory queries on data, where users are satisfied with "close-enough" answers if they can come quickly. A popular technique for speeding up queries at the cost of accuracy is to execute each query on a sample of data, rather than the whole dataset. To ensure that the returned result is not too inaccurate, past work on approximate query processing has used statistical techniques to estimate "error bars" on returned results. However, existing work in the sampling-based approximate query processing (S-AQP) community has not validated whether these techniques actually generate accurate error bars for real query workloads. In fact, we find that error bar estimation often fails on real world production workloads. Fortunately, it is possible to quickly and accurately diagnose the failure of error estimation for a query. In this paper, we show that it is possible to implement a query approximation pipeline that produces approximate answers and reliable error bars at interactive speeds.

show abstract

Section: Diagnosismentioning

confidence: 99%

“…Algorithm 1 is the diagnostic of Kleiner et al [22], specialized to query approximation. We provide it here for completeness.…”

Section: Appendix a Diagnostic Algorithmmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Knowing when you're wrong

Agarwal

Milner

Kleiner

et al. 2014

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Self Cite

118

View full text Add to dashboard Cite

show abstract

“…This derivation is a manual process that is ad-hoc and often impractical for complex queries [34]. 1 To address this problem, a second approach, called bootstrap, has emerged as a more general method for routine estimation of errors [27,30,34]. Bootstrap [19] is essentially a Monte Carlo procedure, which for a given initial sample, (i) repeatedly forms simulated datasets by resampling tuples i.i.d.…”

Section: Introductionmentioning

confidence: 99%

The analytical bootstrap

Zeng

Shi

Mozafari

et al. 2014

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

View full text Add to dashboard Cite

Sampling is one of the most commonly used techniques in Approximate Query Processing (AQP)-an area of research that is now made more critical by the need for timely and cost-effective analytics over "Big Data". Assessing the quality (i.e., estimating the error) of approximate answers is essential for meaningful AQP, and the two main approaches used in the past to address this problem are based on either (i) analytic error quantification or (ii) the bootstrap method. The first approach is extremely efficient but lacks generality, whereas the second is quite general but suffers from its high computational overhead. In this paper, we introduce a probabilistic relational model for the bootstrap process, along with rigorous semantics and a unified error model, which bridges the gap between these two traditional approaches. Based on our probabilistic framework, we develop efficient algorithms to predict the distribution of the approximation results. These enable the computation of any bootstrap-based quality measure for a large class of SQL queries via a single-round evaluation of a slightly modified query. Extensive experiments on both synthetic and real-world datasets show that our method has superior prediction accuracy for bootstrap-based quality measures, and is several orders of magnitude faster than bootstrap.

show abstract

Skew‐aware online aggregation over joins through guided sampling

Wang

Jin

et al. 2018

Concurrency and Computation

View full text Add to dashboard Cite

Online aggregation is a query processing technique that returns approximate answers with error guarantees (in the form of confidence intervals) continuously during the query execution process.This approach offers users a suitable tradeoff between query efficiency and accuracy. The key issue of online aggregation is how to ensure a random sample collection's efficiency and effectiveness. However, the often-used "blind" sampling method does not adequately consider dataset statistics and other useful information, leading to inefficient sampling and poor sample quality.This becomes a glaring performance issue for skewed data distribution over joins. To alleviate this problem, we utilize dataset statistics to propose a new "guided" sampling approach, which consists of a logic-partition-based weighted Gaussian sampling method tailored for the skewed join key, as well as a two-level sample allocation method that applies to the skewed measured value.Extensive experiments using the TPC-H benchmark for skewed data distribution demonstrate our solution's superior performance.

show abstract

A general bootstrap performance diagnostic

Cited by 17 publications

References 19 publications

Knowing when you're wrong

Knowing when you're wrong

The analytical bootstrap

Skew‐aware online aggregation over joins through guided sampling

Contact Info

Product

Resources

About