Finding Frequent Items in Data Streams

Charikar, Moses; Chen, Kevin; Farach-Colton, Martı́n

doi:10.1007/3-540-45465-9_59

Cited by 511 publications

(223 citation statements)

References 6 publications

Supporting

Mentioning

220

Contrasting

Unclassified

Order By: Relevance

“…• Distinct Counting: Every element in the input stream is hashed (uniformly) between (0, 1), and the t lowest hash values are stored. At the end, the algorithm reports t/min t as the number of distinct elements in the stream, where min t is the value of 8 Alternately, we could use a constraint ω 1 > ω 2 , and set ω to be some value between ω 1 and ω 2 . the tth smallest hash value.…”

Section: Lossycounting and Distinct Counting Algorithmmentioning

confidence: 99%

See 1 more Smart Citation

New Streaming Algorithms for Fast Detection of Superspreaders

Venkataraman

Song²,

Gibbons³

et al. 2004

164

177

View full text Add to dashboard Cite

High-speed monitoring of Internet traffic is an important and challenging problem, with applications to realtime attack detection and mitigation, traffic engineering, etc. However, packet-level monitoring requires fast streaming algorithms that use very little memory and little communication among collaborating network monitoring points.In this paper, we consider the problem of detecting superspreaders, which are sources that connect to a large number of distinct destinations. We propose new streaming algorithms for detecting superspreaders and prove guarantees on their accuracy and memory requirements. We also show experimental results on real network traces. Our algorithms are substantially more efficient (both theoretically and experimentally) than previous approaches. We also extend our algorithms to identify superspreaders in a distributed setting, with sliding windows, and when deletions are allowed in the stream (which lets us identify sources that make a large number of failed connections to distinct destinations).More generally, our algorithms are applicable to any problem that can be formulated as follows: given a stream of (x, y) pairs, find all the x's that are paired with a large number of distinct y's. We call this the heavy distinct-hitters problem. There are many network security applications of this general problem. This paper discusses these applications and, for concreteness, focuses on the superspreader problem.

show abstract

Section: Lossycounting and Distinct Counting Algorithmmentioning

confidence: 99%

“…Note that a superspreader is different from the usual definition of a heavy-hitter ( [19,8,16,25,13,24]). A heavy-hitter might be a source that sends a lot of packets, and thus exceeds a certain threshold of the total traffic.…”

Section: Introductionmentioning

confidence: 99%

New Streaming Algorithms for Fast Detection of Superspreaders

Venkataraman

Song²,

Gibbons³

et al. 2004

164

177

View full text Add to dashboard Cite

show abstract

“…Such an approach based on maintaining distinct counters would not only be more complex than our approach, but also likely have a greater space complexity, since maintaining distinct counters with a relative error of ε requires Ω(1 /ε 2 ) space [17]. The sketch approach, such as count‐sketch [33] or count‐min sketch [12], also maintains multiple counters, each of which is the sum of many random variables. Replacing each such counter with a distinct counter leads to its own set of difficulties, one of which is the space complexity of distinct counting, explained above, and the other being the fact that each distinct counter is only approximate (exact distinct counting necessarily requires large space [34]), while the analyses in refs 33 and 12 rely on the different counters in the data structure being exact.…”

Section: Related Workmentioning

confidence: 99%

“…An item is called a φ ‐heavy hitter if it contributes to at least a φ fraction of the entire volume of the stream. There is a large body of literature on heavy‐hitter identification (including [7–12]). A persistent item need not be a heavy hitter.…”

Section: Introductionmentioning

confidence: 99%

Space‐efficient tracking of persistent items in a massive data stream

Lahiri¹,

Tirthapura

Chandrashekar

2013

Statistical Analysis

View full text Add to dashboard Cite

Motivated by scenarios in network anomaly detection, we consider the problem of detecting persistent items in a data stream, which are items that occur ‘regularly’ in the stream. In contrast with heavy hitters, persistent items do not necessarily contribute significantly to the volume of a stream, and may escape detection by traditional volume‐based anomaly detectors. We first show that any online algorithm that tracks persistent items exactly must necessarily use a large workspace, and is infeasible to run on a traffic monitoring node. In light of this lower bound, we introduce an approximate formulation of the problem and present a small‐space algorithm to approximately track persistent items over a large data stream. We experimented with three different datasets to see how the accuracy and memory footprint of the algorithm varies with the skewness of the dataset. Our algorithms performed best for the two datasets out of three which had highest skewness of persistence and lowest mean persistence. To our knowledge, this is the first systematic study of the problem of detecting persistent items in a data stream, and our work can help detect anomalies that are temporal, rather than volume‐based. © 2013 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2013

show abstract

“…A related and more general problem is that of finding the set of most frequent elements in the stream. This problem comes up in the context of many networks applications, one example is that of search engines, where the data streams in question are streams of queries sent to the search engine and we are interested in finding the most frequent queries in some period of time (Google Zeitgeist 3 and Charikar et al [2002]). The data stream here is so large that any memory intensive solutions such as sorting the stream or keeping a counter for each distinct element would be infeasible, and moreover we can afford to make only one pass over the data.…”

Section: Frequent Elements Querymentioning

confidence: 99%

Data stream algorithms

Tang¹

View full text Add to dashboard Cite

Finding Frequent Items in Data Streams

Cited by 511 publications

References 6 publications

New Streaming Algorithms for Fast Detection of Superspreaders

New Streaming Algorithms for Fast Detection of Superspreaders

Space‐efficient tracking of persistent items in a massive data stream

Data stream algorithms

Contact Info

Product

Resources

About