2002
DOI: 10.1007/3-540-45465-9_59
|View full text |Cite
|
Sign up to set email alerts
|

Finding Frequent Items in Data Streams

Abstract: Abstract. We present a 1-pass algorithm for estimating the most frequent items in a data stream using very limited storage space. Our method relies on a novel data structure called a count sketch, which allows us to estimate the frequencies of all the items in the stream. Our algorithm achieves better space bounds than the previous best known algorithms for this problem for many natural distributions on the item frequencies. In addition, our algorithm leads directly to a 2-pass algorithm for the problem of est… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
220
0
2

Year Published

2004
2004
2021
2021

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 511 publications
(223 citation statements)
references
References 6 publications
1
220
0
2
Order By: Relevance
“…• Distinct Counting: Every element in the input stream is hashed (uniformly) between (0, 1), and the t lowest hash values are stored. At the end, the algorithm reports t/min t as the number of distinct elements in the stream, where min t is the value of 8 Alternately, we could use a constraint ω 1 > ω 2 , and set ω to be some value between ω 1 and ω 2 . the tth smallest hash value.…”
Section: Lossycounting and Distinct Counting Algorithmmentioning
confidence: 99%
See 1 more Smart Citation
“…• Distinct Counting: Every element in the input stream is hashed (uniformly) between (0, 1), and the t lowest hash values are stored. At the end, the algorithm reports t/min t as the number of distinct elements in the stream, where min t is the value of 8 Alternately, we could use a constraint ω 1 > ω 2 , and set ω to be some value between ω 1 and ω 2 . the tth smallest hash value.…”
Section: Lossycounting and Distinct Counting Algorithmmentioning
confidence: 99%
“…Note that a superspreader is different from the usual definition of a heavy-hitter ( [19,8,16,25,13,24]). A heavy-hitter might be a source that sends a lot of packets, and thus exceeds a certain threshold of the total traffic.…”
Section: Introductionmentioning
confidence: 99%
“…Such an approach based on maintaining distinct counters would not only be more complex than our approach, but also likely have a greater space complexity, since maintaining distinct counters with a relative error of ε requires Ω(1 /ε 2 ) space [17]. The sketch approach, such as count‐sketch [33] or count‐min sketch [12], also maintains multiple counters, each of which is the sum of many random variables. Replacing each such counter with a distinct counter leads to its own set of difficulties, one of which is the space complexity of distinct counting, explained above, and the other being the fact that each distinct counter is only approximate (exact distinct counting necessarily requires large space [34]), while the analyses in refs 33 and 12 rely on the different counters in the data structure being exact.…”
Section: Related Workmentioning
confidence: 99%
“…An item is called a φ ‐heavy hitter if it contributes to at least a φ fraction of the entire volume of the stream. There is a large body of literature on heavy‐hitter identification (including [7–12]). A persistent item need not be a heavy hitter.…”
Section: Introductionmentioning
confidence: 99%
“…A related and more general problem is that of finding the set of most frequent elements in the stream. This problem comes up in the context of many networks applications, one example is that of search engines, where the data streams in question are streams of queries sent to the search engine and we are interested in finding the most frequent queries in some period of time (Google Zeitgeist 3 and Charikar et al [2002]). The data stream here is so large that any memory intensive solutions such as sorting the stream or keeping a counter for each distinct element would be infeasible, and moreover we can afford to make only one pass over the data.…”
Section: Frequent Elements Querymentioning
confidence: 99%