Quantiles over data streams

Wang, Lu; Luo, Ge; Yi, Ke; Cormode, Graham

doi:10.1145/2463676.2465312

Cited by 50 publications

(39 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We compare against a number of alternative quantile summaries: a mergeable equi-width histogram (EW-Hist) using power-of-two ranges [65], the 'GKArray' (GK) variant of the Greenwald Khanna [34,52] sketch, the AVL-tree T-Digest (T-Digest) [28] sketch, the streaming histogram (S-Hist) in [12] as implemented in Druid, the 'Random' (RandomW) sketch from [52,77], reservoir sampling (Sampling) [76], and the low discrepancy mergeable sketch (Merge12) from [3], both implemented in the Yahoo! datasketches library [1].…”

Section: Methodsmentioning

confidence: 99%

“…We quantify the accuracy of a quantile estimate using the quantile error ε as defined in Section 3.1. Then, as in [52,77] we can compare the accuracies of summaries on a given dataset by computing their average error ϵ avg over a set of uniformly spaced ϕquantiles. In the evaluation that follows, we test on 21 equally spaced ϕ between 0.01 and 0.99.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Moment-based quantile sketches for efficient high cardinality aggregation queries

Gan¹,

Ding²,

Tai³

et al. 2018

Proc. VLDB Endow.

View full text Add to dashboard Cite

Interactive analytics increasingly involves querying for quantiles over sub-populations of high cardinality datasets. Data processing engines such as Druid and Spark use mergeable summaries to estimate quantiles, but summary merge times can be a bottleneck during aggregation. We show how a compact and efficiently mergeable quantile sketch can support aggregation workloads. This data structure, which we refer to as the moments sketch, operates with a small memory footprint (200 bytes) and computationally efficient (50ns) merges by tracking only a set of summary statistics, notably the sample moments. We demonstrate how we can efficiently estimate quantiles using the method of moments and the maximum entropy principle, and show how the use of a cascade further improves query time for threshold predicates. Empirical evaluation shows that the moments sketch can achieve less than 1 percent quantile error with 15× less overhead than comparable summaries, improving end query time in the MacroBase engine by up to 7× and the Druid engine by up to 60×. *

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Moment-based quantile sketches for efficient high cardinality aggregation queries

Gan¹,

Ding²,

Tai³

et al. 2018

Proc. VLDB Endow.

View full text Add to dashboard Cite

show abstract

“…A simple deterministic version of their algorithm achieves the same bounds. This was pointed out, for example, by [1]. We refer to their algorithm as MRL.…”

Section: Related Workmentioning

confidence: 97%

Optimal Quantile Approximation in Streams

Karnin

Lang

Liberty

2016

2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS)

View full text Add to dashboard Cite

This paper resolves one of the longest standing basic problems in the streaming computational model. Namely, optimal construction of quantile sketches. An ε approximate quantile sketch receives a stream of items x 1 , . . . , x n and allows one to approximate the rank of any query up to additive error εn with probability at least 1 − δ. The rank of a query x is the number of stream items such that x i ≤ x. The minimal sketch size required for this task is trivially at least 1/ε. Felber and Ostrovsky obtain a O((1/ε) log(1/ε)) space sketch for a fixed δ. To date, no better upper or lower bounds were known even for randomly permuted streams or for approximating a specific quantile, e.g., the median. This paper obtains an O((1/ε) log log(1/δ)) space sketch and a matching lower bound. This resolves the open problem and proves a qualitative gap between randomized and deterministic quantile sketching. One of our contributions is a novel representation and modification of the widely used merge-and-reduce construction. This subtle modification allows for an analysis which is both tight and extremely simple. Similar techniques should be useful for improving other sketching objectives and geometric coreset constructions.

show abstract

“…Shrivastava et al [24] present a streaming algorithm for -approximate quantiles called the "QDigest" that has a space complexity of O( 1 log U ), where U is the size of the input domain. Wang et al [26] performed an experimental evaluation of different streaming algorithms [15,24,19]. They concluded that MRL99 [19] and Greenwald-Khanna [15] are two very competitive algorithms with MRL99 performing slightly better than Greenwald-Khanna in terms of space requirement and time for a given accuracy.…”

Section: Related Workmentioning

confidence: 99%

Estimating quantiles from the union of historical and streaming data

2016

View full text Add to dashboard Cite

Modern enterprises generate huge amounts of streaming data, for example, micro-blog feeds, financial data, network monitoring and industrial application monitoring. While Data Stream Management Systems have proven successful in providing support for real-time alerting, many applications, such as network monitoring for intrusion detection and real-time bidding, require complex analytics over historical and real-time data over the data streams. We present a new method to process one of the most fundamental analytical primitives, quantile queries, on the union of historical and streaming data. Our method combines an index on historical data with a memory-efficient sketch on streaming data to answer quantile queries with accuracy-resource tradeoffs that are significantly better than current solutions that are based solely on disk-resident indexes or solely on streaming algorithms. Disciplines Electrical and Computer Engineering CommentsThis is a manuscript of a proceeding published as Singh, Sneha Aman, Divesh Srivastava, and Srikanta Tirthapura. "Estimating quantiles from the union of historical and streaming data. Systems have proven successful in providing support for real-time alerting, many applications, such as network monitoring for intrusion detection and real-time bidding, require complex analytics over historical and real-time data over the data streams. We present a new method to process one of the most fundamental analytical primitives, quantile queries, on the union of historical and streaming data. Our method combines an index on historical data with a memory-efficient sketch on streaming data to answer quantile queries with accuracy-resource tradeoffs that are significantly better than current solutions that are based solely on disk-resident indexes or solely on streaming algorithms.

show abstract

Quantiles over data streams

Cited by 50 publications

References 31 publications

Moment-based quantile sketches for efficient high cardinality aggregation queries

Moment-based quantile sketches for efficient high cardinality aggregation queries

Optimal Quantile Approximation in Streams

Estimating quantiles from the union of historical and streaming data

Contact Info

Product

Resources

About