2011
DOI: 10.1007/978-3-642-23783-6_42
|View full text |Cite
|
Sign up to set email alerts
|

The VC-Dimension of SQL Queries and Selectivity Estimation through Sampling

Abstract: Abstract. We develop a novel method, based on the statistical concept of VC-dimension, for evaluating the selectivity (output cardinality) of SQL queries -a crucial step in optimizing the execution of large scale database and data-mining operations. The major theoretical contribution of this work, which is of independent interest, is an explicit bound on the VC-dimension of a range space defined by all possible outcomes of a collection (class) of queries. We prove that the VC-dimension is a function of the max… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
15
0

Year Published

2011
2011
2022
2022

Publication Types

Select...
5
2
2

Relationship

4
5

Authors

Journals

citations
Cited by 21 publications
(15 citation statements)
references
References 66 publications
0
15
0
Order By: Relevance
“…For instance, super-level sets of G, T are balls in B. ε-Samples are a very common and powerful coreset for approximating P ; the set S can be used as proxy for P in many diverse applications (c.f. [2,33,15,34]). For binary range spaces with constant VC-dimension [39] a random sample S of size O((1/ε 2 ) log(1/δ)) provides an ε-sample with probability at least 1 − δ [26].…”
Section: Combinatorial Geometry Connectionmentioning
confidence: 99%
“…For instance, super-level sets of G, T are balls in B. ε-Samples are a very common and powerful coreset for approximating P ; the set S can be used as proxy for P in many diverse applications (c.f. [2,33,15,34]). For binary range spaces with constant VC-dimension [39] a random sample S of size O((1/ε 2 ) log(1/δ)) provides an ε-sample with probability at least 1 − δ [26].…”
Section: Combinatorial Geometry Connectionmentioning
confidence: 99%
“…Data-driven Cardinality Estimation Data-driven cardinality estimation methods construct estimation models based on the underlying data. First, sampling-based methods [27,47,60] estimate cardinalities by scanning a sample of data, which has space overhead and can be expensive. Histograms [17,34,35,49,52,58,69,70,73] construct histograms to approximate the data distribution.…”
Section: Related Workmentioning
confidence: 99%
“…Hence, we may consider reducing the execution cost if it entails a bounded impact on the quality of the end result [9]. For example, Riondato et al [121] develop a method for random sampling of a database for estimating the selectivity of a query. Given a class of queries, the execution of any query in that class on the sample provides an accurate estimate for the selectivity of the query on the original large database.…”
Section: Data Management and Machine Learningmentioning
confidence: 99%
“…It is then an important direction to establish probabilistic models that capture the combined process and allow to estimate probabilities of end results. For example, by applying the notion of the Vapnik-Chervonenkis dimension, an important theoretical concept in generalization theory, to database queries, Riondato et al [121] provide accurate bounds for their selectivity estimates that hold with high probability; moreover, they show the error probability to hold simultaneously for the selectivity estimates of all queries in the query class. In general, this direction can leverage the past decade of research on probabilistic databases [47,90,23,91], which can be combined with theoretical frameworks of machine learning, such as PAC (Probably Approximately Correct) learning [139].…”
Section: Data Management and Machine Learningmentioning
confidence: 99%