Algorithms for distributed functional monitoring

Cormode, Graham; Muthukrishnan, S.; Yi, Ke

doi:10.1145/1921659.1921667

Cited by 124 publications

(161 citation statements)

References 28 publications

Supporting

Mentioning

158

Contrasting

Order By: Relevance

“…Consider the case where each P i 's dataset arrives in a continuous stream; this is what is known as a distributed data stream (Cormode et al, 2008). Then applying results of (Cormode et al, 2010), we can continually maintain a sufficient random sample at the coordinator of size s ε communicating O((k + s ε,ν )d log |D|) words.…”

Section: Improved Random Sampling For K-playersmentioning

confidence: 99%

Efficient Protocols for Distributed Classification and Optimization

Daumé

Phillips

Saha

et al. 2012

Lecture Notes in Computer Science

View full text Add to dashboard Cite

In distributed learning, the goal is to perform a learning task over data distributed across multiple nodes with minimal (expensive) communication. Prior work (Daumé III et al., 2012) proposes a general model that bounds the communication required for learning classifiers while allowing for ε training error on linearly separable data adversarially distributed across nodes.In this work, we develop key improvements and extensions to this basic model. Our first result is a two-party multiplicative-weight-update based protocol that uses O(d 2 log 1/ε) words of communication to classify distributed data in arbitrary dimension d, ε-optimally. This readily extends to classification over k nodes with O(kd 2 log 1/ε) words of communication. Our proposed protocol is simple to implement and is considerably more efficient than baselines compared, as demonstrated by our empirical results.In addition, we illustrate general algorithm design paradigms for doing efficient learning over distributed data. We show how to solve fixed-dimensional and high dimensional linear programming efficiently in a distributed setting where constraints may be distributed across nodes. Since many learning problems can be viewed as convex optimization problems where constraints are generated by individual points, this models many typical distributed learning scenarios. Our techniques make use of a novel connection from multipass streaming, as well as adapting the multiplicative-weight-update framework more generally to a distributed setting. As a consequence, our methods extend to the wide range of problems solvable using these techniques.

show abstract

Section: Improved Random Sampling For K-playersmentioning

confidence: 99%

Efficient Protocols for Distributed Classification and Optimization

Daumé

Phillips

Saha

et al. 2012

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…However, our goal is to come up with practical algorithms for both detecting and estimating the size of icebergs in real data sets. Recently, Cormode et al [16] proposed the problem of functional monitoring, where the local nodes continuously send updates only insofar as needed to satisfy some global constraint (e.g., detecting all the icebergs). Our work differs from theirs since we assume fixed measurement periods, which potentially allows us to have more communication-efficient mechanisms.…”

Section: Related Workmentioning

confidence: 99%

Uncovering Global Icebergs in Distributed Streams: Results and Implications

et al. 2010

View full text Add to dashboard Cite

Discovering icebergs in distributed streams of data is an important problem for a number of applications in networking and databases. While previous work has concentrated on measuring these icebergs in the non-distributed streaming case or in the non-streaming distributed case, we present a general framework that allows for distributed processing across multiple streams of data. We compare several of the state-of-the-art streaming algorithms for estimating local elephants in the individual streams. However, since an iceberg may be hidden by being distributed across many different streams, we add a sampling component to handle such cases. We provide a novel taxonomy of current sketches and perform a thorough analysis of the strengths and weaknesses of each scheme under various QoS metrics, using both real and synthetic Internet trace data. We summarize their performance and discuss the implications for the future design of sketches.

show abstract

“…All protocols in the distributed streaming model are also valid protocols in our one-shot computational model, while our impossibility results in our one-shot computational model also apply to all protocols in the distributed streaming model. Example functions studied in the distributed streaming model include F 0 [7], F 2 (size of self join) [7,27], quantile and heavy-hitters [16], and the empirical entropy [3]. All of these problems have much lower communication cost if one allows an approximation of the output number x in a range [(1 − ε)x, (1 + ε)x], as mentioned above (the definition as to what ε is for the various problems differs).…”

Section: Related Workmentioning

confidence: 99%

“…In [7], a (1 + ε)-approximation algorithm (protocol) with O(k(log n + 1/ε 2 log 1/ε)) bits of communication was given in the distributed streaming model, which is also a protocol in the message-passing model. In a typical setting, we could have ε = 0.01, n = 10 9 and k = 1000, in which case the communication cost is about 6.6 × 10 7 bits 1 .…”

Section: The Number Of Distinct Elements: a Case Studymentioning

confidence: 99%

When Distributed Computation Is Communication Expensive

Woodruff

Zhang

2013

Lecture Notes in Computer Science

View full text Add to dashboard Cite

We consider a number of fundamental statistical and graph problems in the message-passing model, where we have k machines (sites), each holding a piece of data, and the machines want to jointly solve a problem defined on the union of the k data sets. The communication is point-to-point, and the goal is to minimize the total communication among the k machines. This model captures all point-to-point distributed computational models with respect to minimizing communication costs. Our analysis shows that exact computation of many statistical and graph problems in this distributed setting requires a prohibitively large amount of communication, and often one cannot improve upon the communication of the simple protocol in which all machines send their data to a centralized server. Thus, in order to obtain protocols that are communication-efficient, one has to allow approximation, or investigate the distribution or layout of the data sets.

show abstract

Algorithms for distributed functional monitoring

Cited by 124 publications

References 28 publications

Efficient Protocols for Distributed Classification and Optimization

Efficient Protocols for Distributed Classification and Optimization

Uncovering Global Icebergs in Distributed Streams: Results and Implications

When Distributed Computation Is Communication Expensive

Contact Info

Product

Resources

About