Speeding Up Distributed Machine Learning Using Codes

Lee, Kangwook; Lam, Maximilian; Pedarsani, Ramtin; Papailiopoulos, Dimitris S.; Ramchandran, Kannan

doi:10.1109/tit.2017.2736066

Cited by 657 publications

(428 citation statements)

References 80 publications

Supporting

Mentioning

424

Contrasting

Order By: Relevance

“…The shifted exponential model for computation time, which is the sum of a constant (deterministic) term and a variable (stochastic) term, is motivated by the distribution model proposed by authors in [28] for latency in querying data files from cloud storage systems. As demonstrated in [10] as well as by our own experiments, exponential model provides a good fit for the distribution of computation times over cloud computing environments such as Amazon EC2 clusters.…”

Section: B Network Modelsupporting

confidence: 53%

“…As we state in the following Theorem, HCMM provides an unbounded gain of Θ(log n) over uncoded scheme, in terms of expected running time. This result illustrates that leveraging coded computing, one achieves the same order-wise gain over heterogeneous clusters as over homogeneous clusters [10]. Theorem 2.…”

Section: Resultsmentioning

confidence: 68%

“…The master node can then obtain the final result from any k responses. In [10], the authors find the optimal k for minimizing the average running time.…”

Section: Problem Formulationmentioning

confidence: 99%

“…This result implies that increasing the computation load by a factor of r (i.e., evaluating each computation at r carefully chosen nodes) can create novel coding opportunities that reduce the required communication load for computing by the same factor r. Hence, these codes can be utilized to pool the underutilized computing resources at network edge to slash the communication load of Fog computing [9]. Other related works tackling the communication bottleneck in distributed computation include [10]- [14].…”

Section: Introductionmentioning

confidence: 99%

“…In the second coding concept introduced in [10], an inverse-linear tradeoff between computation load and computation latency (i.e., the overall job response time) is established for distributed matrix multiplication in homogeneous computing environments. More specifically, this approach utilizes coding to effectively inject redundant computations to alleviate the effects of stragglers and speed up the computations.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Coded computation over heterogeneous clusters

Reisizadeh

Prakash

Pedarsani

et al. 2017

2017 IEEE International Symposium on Information Theory (ISIT)

Self Cite

108

156

View full text Add to dashboard Cite

In large-scale distributed computing clusters, such as Amazon EC2, there are several types of "system noise" that can result in major degradation of performance: system failures, bottlenecks due to limited communication bandwidth, latency due to straggler nodes, etc. On the other hand, these systems enjoy abundance of redundancy -a vast number of computing nodes and large storage capacity. There have been recent results that demonstrate the impact of coding for efficient utilization of computation and storage redundancy to alleviate the effect of stragglers and communication bottlenecks in homogeneous clusters. In this paper, we focus on general heterogeneous distributed computing clusters consisting of a variety of computing machines with different capabilities. We propose a coding framework for speeding up distributed computing in heterogeneous clusters by trading redundancy for reducing the latency of computation. In particular, we propose Heterogeneous Coded Matrix Multiplication (HCMM) algorithm for performing distributed matrix multiplication over heterogeneous clusters that is provably asymptotically optimal for a broad class of processing time distributions. Moreover, we show that HCMM is unboundedly faster than uncoded schemes that partition the total work load among the workers. To demonstrate how the proposed HCMM scheme can be applied in practice, we provide numerical results demonstrating significant speedups of up to 90% and 35% for HCMM in comparison to the "uncoded" and "coded homogeneous" schemes, respectively. Furthermore, we carry out real experiments over Amazon EC2 clusters that corroborate our numerical studies, where HCMM is found to be up to 17% faster than the uncoded scheme. Additionally, our observation is that machines rarely become stragglers and when they do, they continue to exhibit slower performance for sometime. In our worst case experiments with artificial stragglers, HCMM provides speedups of up to 12× over the uncoded scheme. Furthermore, we provide a generalization of the problem of optimal load allocation for heterogeneous clusters to scenarios with budget constraints and develop a heuristic algorithm for efficient load allocation. In the end, we discuss about the decoding complexity and describe how LDPC codes can be combined with HCMM in order to control the complexity of decoding as the problem size increases.

show abstract

Section: B Network Modelsupporting

confidence: 53%

Section: Resultsmentioning

confidence: 68%

“…The master node can then obtain the final result from any k responses. In [10], the authors find the optimal k for minimizing the average running time.…”

Section: Problem Formulationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Coded computation over heterogeneous clusters

Reisizadeh

Prakash

Pedarsani

et al. 2017

2017 IEEE International Symposium on Information Theory (ISIT)

Self Cite

108

156

View full text Add to dashboard Cite

show abstract

Composite optimization with coupling constraints via dual proximal gradient method with applications to asynchronous networks

Wang

2022

Intl J Robust & Nonlinear

View full text Add to dashboard Cite

In this article, we consider solving a composite optimization problem with affine coupling constraints in a multi-agent network based on proximal gradient method. In this problem, all the agents jointly minimize the sum of individual cost functions composed of smooth and possibly non-smooth parts. To this end, we derive the dual problem by the concept of Fenchel conjugate, which gives rise to the dual proximal gradient (DPG) algorithm by allowing for the asymmetric individual interpretations of the coupling constraints. Then, an asynchronous DPG (Asyn-DPG) algorithm is proposed for the asynchronous networks with heterogeneous step-sizes and communication delays. For both the two algorithms, if the non-smooth parts of the objective functions are simple-structured, we only need to update dual variables by some simple operations, accounting for the reduction of the overall computational complexity. Analytical convergence rate of the proposed algorithms is derived and their efficacy is verified by solving a social welfare optimization problem of electricity market in the numerical simulation.

show abstract

Artificial Intelligence and Machine Learning for Large-Scale Data

Phu

Tran

2018

Computational Intelligence and Sustainable Systems

View full text Add to dashboard Cite

Speeding Up Distributed Machine Learning Using Codes

Cited by 657 publications

References 80 publications

Coded computation over heterogeneous clusters

Coded computation over heterogeneous clusters

Composite optimization with coupling constraints via dual proximal gradient method with applications to asynchronous networks

Artificial Intelligence and Machine Learning for Large-Scale Data

Contact Info

Product

Resources

About