The VC-Dimension of SQL Queries and Selectivity Estimation through Sampling

Riondato, Matteo; Akdere, Mert; Çetintemel, Uğgur; Zdonik, Stanley B.; Upfal, Eli

doi:10.1007/978-3-642-23783-6_42

Cited by 21 publications

(15 citation statements)

References 66 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For instance, super-level sets of G, T are balls in B. ε-Samples are a very common and powerful coreset for approximating P ; the set S can be used as proxy for P in many diverse applications (c.f. [2,33,15,34]). For binary range spaces with constant VC-dimension [39] a random sample S of size O((1/ε 2 ) log(1/δ)) provides an ε-sample with probability at least 1 − δ [26].…”

Section: Combinatorial Geometry Connectionmentioning

confidence: 99%

ε-Samples for Kernels

Phillips¹

2013

Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms

View full text Add to dashboard Cite

We study the worst case error of kernel density estimates via subset approximation. A kernel density estimate of a distribution is the convolution of that distribution with a fixed kernel (e.g. Gaussian kernel). Given a subset (i.e. a point set) of the input distribution, we can compare the kernel density estimates of the input distribution with that of the subset and bound the worst case error. If the maximum error is ε, then this subset can be thought of as an ε-sample (aka an ε-approximation) of the range space defined with the input distribution as the ground set and the fixed kernel representing the family of ranges. Interestingly, in this case the ranges are not binary, but have a continuous range (for simplicity we focus on kernels with range of [0, 1]); these allow for smoother notions of range spaces.It turns out, the use of this smoother family of range spaces has an added benefit of greatly decreasing the size required for ε-samples. For instance, in the plane the size is O((1/ε 4/3 ) log 2/3 (1/ε)) for disks (based on VC-dimension arguments) but is only O((1/ε) log(1/ε)) for Gaussian kernels and for kernels with bounded slope that only affect a bounded domain. These bounds are accomplished by studying the discrepancy of these "kernel" range spaces, and here the improvement in bounds are even more pronounced.In the plane, we show the discrepancy is O( √ log n) for these kernels, whereas for balls there is a lower bound of Ω(n 1/4 ).

show abstract

Section: Combinatorial Geometry Connectionmentioning

confidence: 99%

ε-Samples for Kernels

Phillips¹

2013

Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms

View full text Add to dashboard Cite

show abstract

“…Data-driven Cardinality Estimation Data-driven cardinality estimation methods construct estimation models based on the underlying data. First, sampling-based methods [27,47,60] estimate cardinalities by scanning a sample of data, which has space overhead and can be expensive. Histograms [17,34,35,49,52,58,69,70,73] construct histograms to approximate the data distribution.…”

Section: Related Workmentioning

confidence: 99%

A Unified Deep Model of Learning from both Data and Queries for Cardinality Estimation

Wu¹,

Cong

2021

Proceedings of the 2021 International Conference on Management of Data

View full text Add to dashboard Cite

Cardinality estimation is a fundamental problem in database systems. To capture the rich joint data distributions of a relational table, most of the existing work either uses data as unsupervised information or uses query workload as supervised information. Very little work has been done to use both types of information, and cannot fully make use of both types of information to learn the joint data distribution. In this work, we aim to close the gap between data-driven and query-driven methods by proposing a new unified deep autoregressive model, UAE, that learns the joint data distribution from both the data and query workload. First, to enable using the supervised query information in the deep autoregressive model, we develop differentiable progressive sampling using the Gumbel-Softmax trick. Second, UAE is able to utilize both types of information to learn the joint data distribution in a single model. Comprehensive experimental results demonstrate that UAE achieves single-digit multiplicative error at tail, better accuracies over state-of-the-art methods, and is both space and time efficient.

show abstract

“…Hence, we may consider reducing the execution cost if it entails a bounded impact on the quality of the end result [9]. For example, Riondato et al [121] develop a method for random sampling of a database for estimating the selectivity of a query. Given a class of queries, the execution of any query in that class on the sample provides an accurate estimate for the selectivity of the query on the original large database.…”

Section: Data Management and Machine Learningmentioning

confidence: 99%

“…It is then an important direction to establish probabilistic models that capture the combined process and allow to estimate probabilities of end results. For example, by applying the notion of the Vapnik-Chervonenkis dimension, an important theoretical concept in generalization theory, to database queries, Riondato et al [121] provide accurate bounds for their selectivity estimates that hold with high probability; moreover, they show the error probability to hold simultaneously for the selectivity estimates of all queries in the query class. In general, this direction can leverage the past decade of research on probabilistic databases [47,90,23,91], which can be combined with theoretical frameworks of machine learning, such as PAC (Probably Approximately Correct) learning [139].…”

Section: Data Management and Machine Learningmentioning

confidence: 99%

Research Directions for Principles of Data Management (Abridged)

Abiteboul¹,

Arenas²,

Barceló³

et al. 2017

SIGMOD Rec.

View full text Add to dashboard Cite

Research directions for Principles of Data ManagementPDM played a foundational role in the relational database model, with the robust connection between algebraic and calculus-based query languages, the connection between integrity constraints and database design, key insights for the field of query optimization, and the fundamentals of consistent concurrent transactions. This early work included rich cross-fertilization between PDM and other disciplines in mathematics and computer science, including logic, complexity theory, and knowledge representation. Since the 1990s we have seen an overwhelming increase in both the production of data and the ability to store and access such data. This has led to a phenomenal metamorphosis in the ways that we manage and use data. During this time, we have gone (1) from stand-alone disk-based databases to data that is spread across and linked by the Web, (2) from rigidly structured towards loosely structured data, and (3) from relational data to many different data models (hierarchical, graph-structured, data points, NoSQL, text data, image data, etc.). Research on PDM has developed during that time, too, following, accompanying and influencing this process. It has intensified research on extensions of the relational model (data exchange, incomplete data, probabilistic data, . . . ), on other data models (hierachical, semi-structured, graph, text, . . . ), and on a variety of further data management areas, including knowledge representation and the semantic web, data privacy and security, and data-aware (business) processes. Along the way, the PDM community expanded its cross-fertilization with related areas, to include automata theory, web services, parallel computation, document processing, data structures, scientific workflow, business process management, data-centered dynamic systems, data mining, machine learning, information extraction, etc.Looking forward, three broad areas of data management stand out where principled, mathematical thinking can bring new approaches and much-needed clarity. The first relates to the full lifecycle of so-called "Big Data Analytics", that is, the application of statistical and machine learning techniques to make sense out of, and derive value from, massive volumes of data. The second stems from new forms of data creation and processing, especially as it arises in applications such as web-based commerce, social media applications, and dataaware workflow and business process management. The third, which is just beginning to emerge, is the development of new principles and approaches in support of ethical data management. We briefly illustrate some of the primary ways that these three areas can be supported by the seven PDM research themes that are explored in this report.The overall lifecycle of Big Data Analytics raises a wealth of challenge areas that PDM can help with. As documented in numerous sources, so-called "data wrangling" can form 50% to 80% of the labor costs in an analytics investigation. The challenges of data wrangling can be ...

show abstract

The VC-Dimension of SQL Queries and Selectivity Estimation through Sampling

Cited by 21 publications

References 66 publications

ε-Samples for Kernels

ε-Samples for Kernels

A Unified Deep Model of Learning from both Data and Queries for Cardinality Estimation

Research Directions for Principles of Data Management (Abridged)

Contact Info

Product

Resources

About