Towards estimation error guarantees for distinct values

Charikar, Moses; Chaudhuri, Surajit; Motwani, Rajeev; Narasayya, Vivek

doi:10.1145/335168.335230

Cited by 178 publications

(227 citation statements)

References 21 publications

Supporting

Mentioning

225

Contrasting

Order By: Relevance

“…Distinct Elements: For the number of distinct elements, F 0 , we show that the current best offline methods for estimating F 0 from a random sample can be implemented in a streaming fashion using very small space. While it is known that random sampling can significantly reduce the accuracy of an estimate for F 0 [7], we show that the need to process this stream using small space does not. The upper and lower bounds are presented in Section 4.…”

Section: Frequency Momentsmentioning

confidence: 78%

“…The following theorem is from Charikar et al [7], which we have restated slightly to fit our notation (the original theorem is about database tables). Let F 0 be the number of elements in a data set T of total size n. Note that T maybe a stored data set, and need not be processed in a one-pass streaming manner.…”

Section: Distinct Elementsmentioning

confidence: 99%

“…Theorem 3 (Charikar et al [7]) Consider any (randomized) estimatorF 0 for the number of distinct values F 0 of T , that examines at most r out of the n elements in T . For any γ > e −r , there exists a choice of the input T such that with probability at least γ, the multiplicative error is at least (n − r)/(2r) ln γ −1 .…”

Section: Distinct Elementsmentioning

confidence: 99%

See 2 more Smart Citations

Space-Efficient Estimation of Statistics Over Sub-Sampled Streams

et al. 2015

View full text Add to dashboard Cite

In many stream monitoring situations, the data arrival rate is so high that it is not even possible to observe each element of the stream. The most common solution is to subsample the data stream and use the sample to infer properties and estimate aggregates of the original stream. However, in many cases, the estimation of aggregates on the original stream cannot be accomplished through simply estimating them on the sampled stream, followed by a normalization. We present algorithms for estimating frequency moments, support size, entropy, and heavy hitters of the original stream, through a single pass over the sampled stream. Abstract In many stream monitoring situations, the data arrival rate is so high that it is not even possible to observe each element of the stream. The most common solution is to subsample the data stream and use the sample to infer properties and estimate aggregates of the original stream. However, in many cases, the estimation of aggregates on the original stream cannot be accomplished through simply estimating them on the sampled stream, followed by a normalization. We present algorithms for estimating frequency moments, support size, entropy, and heavy hitters of the original stream, through a single pass over the sampled stream.

show abstract

Section: Frequency Momentsmentioning

confidence: 78%

Section: Distinct Elementsmentioning

confidence: 99%

See 1 more Smart Citation

Space-Efficient Estimation of Statistics Over Sub-Sampled Streams

et al. 2015

View full text Add to dashboard Cite

show abstract

“…It it also shown that the k-th statistical moment can be approximated within an additive error of by using a random sample of size O(1/ 2 log 1 δ ), and that this is a lower bound on the size of the sample. Work that also refer to lower bounds on query complexity for approximate solutions include results on the approximation of the mean [28], [36,91], the approximation on the frequency moment [31].…”

Section: Samplingmentioning

confidence: 99%

On Approximation Algorithms for Data Mining Applications

Afrati

2006

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…It is of importance to query optimization and otherwise to know the number of distinct values that each attribute of the table assumes. The importance of this problem is highlighted in [9]: "A principled choice of an execution plan by an optimizer heavily depends on the availability of 1. Notice that the generated "difference stream," a a À b b, will usually contain negative values corresponding to points where b i > a i .…”

Section: Maintaining Distinct Values In Traditional Databasesmentioning

confidence: 99%

Comparing Data Streams Using Hamming Norms (How to Zero In)

Cormode¹

2002

VLDB '02: Proceedings of the 28th International Conference on Very Large Databases

View full text Add to dashboard Cite

Abstract-Massive data streams are now fundamental to many data processing applications. For example, Internet routers produce large scale diagnostic data streams. Such streams are rarely stored in traditional databases and instead must be processed "on the fly" as they are produced. Similarly, sensor networks produce multiple data streams of observations from their sensors. There is growing focus on manipulating data streams and, hence, there is a need to identify basic operations of interest in managing data streams, and to support them efficiently. We propose computation of the Hamming norm as a basic operation of interest. The Hamming norm formalizes ideas that are used throughout data processing. When applied to a single stream, the Hamming norm gives the number of distinct items that are present in that data stream, which is a statistic of great interest in databases. When applied to a pair of streams, the Hamming norm gives an important measure of (dis)similarity: the number of unequal item counts in the two streams. Hamming norms have many uses in comparing data streams. We present a novel approximation technique for estimating the Hamming norm for massive data streams; this relies on what we call the "l 0 sketch" and we prove its accuracy. We test our approximation method on a large quantity of synthetic and real stream data, and show that the estimation is accurate to within a few percentage points.

show abstract

Towards estimation error guarantees for distinct values

Cited by 178 publications

References 21 publications

Space-Efficient Estimation of Statistics Over Sub-Sampled Streams

Space-Efficient Estimation of Statistics Over Sub-Sampled Streams

On Approximation Algorithms for Data Mining Applications

Comparing Data Streams Using Hamming Norms (How to Zero In)

Contact Info

Product

Resources

About