Quantiles over data streams: experimental comparisons, new analyses, and further improvements

Luo, Ge; Wang, Lu; Yi, Ke; Cormode, Graham

doi:10.1007/s00778-016-0424-7

Cited by 41 publications

(54 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We compare against a number of alternative quantile summaries: a mergeable equi-width histogram (EW-Hist) using power-of-two ranges [65], the 'GKArray' (GK) variant of the Greenwald Khanna [34,52] sketch, the AVL-tree T-Digest (T-Digest) [28] sketch, the streaming histogram (S-Hist) in [12] as implemented in Druid, the 'Random' (RandomW) sketch from [52,77], reservoir sampling (Sampling) [76], and the low discrepancy mergeable sketch (Merge12) from [3], both implemented in the Yahoo! datasketches library [1].…”

Section: Methodsmentioning

confidence: 99%

Moment-based quantile sketches for efficient high cardinality aggregation queries

Gan¹,

Ding²,

Tai³

et al. 2018

Proc. VLDB Endow.

View full text Add to dashboard Cite

Interactive analytics increasingly involves querying for quantiles over sub-populations of high cardinality datasets. Data processing engines such as Druid and Spark use mergeable summaries to estimate quantiles, but summary merge times can be a bottleneck during aggregation. We show how a compact and efficiently mergeable quantile sketch can support aggregation workloads. This data structure, which we refer to as the moments sketch, operates with a small memory footprint (200 bytes) and computationally efficient (50ns) merges by tracking only a set of summary statistics, notably the sample moments. We demonstrate how we can efficiently estimate quantiles using the method of moments and the maximum entropy principle, and show how the use of a cascade further improves query time for threshold predicates. Empirical evaluation shows that the moments sketch can achieve less than 1 percent quantile error with 15× less overhead than comparable summaries, improving end query time in the MacroBase engine by up to 7× and the Druid engine by up to 60×. *

show abstract

Section: Methodsmentioning

confidence: 99%

Moment-based quantile sketches for efficient high cardinality aggregation queries

Gan¹,

Ding²,

Tai³

et al. 2018

Proc. VLDB Endow.

View full text Add to dashboard Cite

show abstract

“…Note that the lower and upper bounds on the rank of any stored number differ by at most 2δN and upper (or lower) bounds on the rank of two consecutive stored numbers differ by at most 2δN as well. The space requirement of Q(δ) is O( 1 δ · log δN ), however, in practice the space used is observed to scale linearly with 1 δ [36]. (Note that an offline optimal data structure for δ-approximate quantiles uses space O 1 δ .)…”

Section: Processing the Stream And Roundingmentioning

confidence: 99%

Streaming Algorithms for Bin Packing and Vector Scheduling

Cormode

Veselý

2020

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Problems involving the efficient arrangement of simple objects, as captured by bin packing and makespan scheduling, are fundamental tasks in combinatorial optimization. These are well understood in the traditional online and offline cases, but have been less well-studied when the volume of the input is truly massive, and cannot even be read into memory. This is captured by the streaming model of computation, where the aim is to approximate the cost of the solution in one pass over the data, using small space. As a result, streaming algorithms produce concise input summaries that approximately preserve the optimum value.We design the first efficient streaming algorithms for these fundamental problems in combinatorial optimization. For BIN PACKING, we provide a streaming asymptotic 1 + ε-approximation with O 1 ε memory, where O hides logarithmic factors. Moreover, such a space bound is essentially optimal. Our algorithm implies a streaming d + ε-approximation for VECTOR BIN PACKING in d dimensions, running in space O d ε . For the related VECTOR SCHEDULING problem, we show how to construct an input summary in space O(d 2 · m/ε 2 ) that preserves the optimum value up to a factor of 2 − 1 m + ε, where m is the number of identical machines.

show abstract

“…Aside from oblivious sampling algorithms (which require storing Ω(1/ε 2 ) samples) the only other such work of which we are aware is an approach by Wang, Luo, Yi, and Cormode [12] that combines the methods of [1] and [8] into a hybrid with the same space bound as [1].…”

Section: Previous and Related Workmentioning

confidence: 99%

“…[1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18] For the comparison model, the best deterministic online summary to date is the (GK) summary of Greenwald and Khanna [4], which uses O((1/ε) log(εn)) space. This improved upon a deterministic (MRL) summary of Manku, Rajagopalan, and Lindsay [7] and a summary implied by Munro and Paterson [9], which use O((1/ε) log 2 (εn)) space.…”

Section: Previous and Related Workmentioning

confidence: 99%

“…This simple summary is not new (it is mentioned in Wang et al [12], for example) but the discussion provides exposition for Section 3, in which we develop this summary into a fully online summary with the same asymptotic space complexity that can answer queries at any point in time. At that point we will have proven the following theorem, which constitutes our main result.…”

Section: Our Contributionsmentioning

confidence: 99%

See 1 more Smart Citation