Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems - PODS '00 2000
DOI: 10.1145/335168.335230
|View full text |Cite
|
Sign up to set email alerts
|

Towards estimation error guarantees for distinct values

Abstract: We consider the problem of estimating the number of distinct values in a column of a table. For large tables without an index on the column, random sampling appears to be the only scalable approach for estimating the number of distinct values. We establish a powerful negative result stating that no estimator can guarantee small error across all input distributions, unless it examines a large fraction of the input data. In fact, any estimator must incur a signi cant error on at least some of a natural class of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
225
0

Year Published

2002
2002
2018
2018

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 178 publications
(227 citation statements)
references
References 21 publications
2
225
0
Order By: Relevance
“…Distinct Elements: For the number of distinct elements, F 0 , we show that the current best offline methods for estimating F 0 from a random sample can be implemented in a streaming fashion using very small space. While it is known that random sampling can significantly reduce the accuracy of an estimate for F 0 [7], we show that the need to process this stream using small space does not. The upper and lower bounds are presented in Section 4.…”
Section: Frequency Momentsmentioning
confidence: 78%
See 2 more Smart Citations
“…Distinct Elements: For the number of distinct elements, F 0 , we show that the current best offline methods for estimating F 0 from a random sample can be implemented in a streaming fashion using very small space. While it is known that random sampling can significantly reduce the accuracy of an estimate for F 0 [7], we show that the need to process this stream using small space does not. The upper and lower bounds are presented in Section 4.…”
Section: Frequency Momentsmentioning
confidence: 78%
“…The following theorem is from Charikar et al [7], which we have restated slightly to fit our notation (the original theorem is about database tables). Let F 0 be the number of elements in a data set T of total size n. Note that T maybe a stored data set, and need not be processed in a one-pass streaming manner.…”
Section: Distinct Elementsmentioning
confidence: 99%
See 1 more Smart Citation
“…It it also shown that the k-th statistical moment can be approximated within an additive error of by using a random sample of size O(1/ 2 log 1 δ ), and that this is a lower bound on the size of the sample. Work that also refer to lower bounds on query complexity for approximate solutions include results on the approximation of the mean [28], [36,91], the approximation on the frequency moment [31].…”
Section: Samplingmentioning
confidence: 99%
“…It is of importance to query optimization and otherwise to know the number of distinct values that each attribute of the table assumes. The importance of this problem is highlighted in [9]: "A principled choice of an execution plan by an optimizer heavily depends on the availability of 1. Notice that the generated "difference stream," a a À b b, will usually contain negative values corresponding to points where b i > a i .…”
Section: Maintaining Distinct Values In Traditional Databasesmentioning
confidence: 99%