ErasureHead: Distributed Gradient Descent without Delays Using Approximate Gradient Coding

Wang, Hongyi; Charles, Zachary; Papailiopoulos, Dimitris S.

doi:10.48550/arxiv.1901.09671

Cited by 19 publications

(30 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This work has been extended in several directions. In [16], the authors developed algorithms to leverage partial computations at the stragglers; the communication and computation properties of GC were studied in [17], [18] and [19]; while GC was extended to distributed SGD in [20] and [21].…”

Section: Psmentioning

confidence: 99%

Adaptive Worker Grouping For Communication-Efficient and Straggler-Tolerant Distributed SGD

Zhu¹,

Zhang²,

Simeone³

et al. 2022

Preprint

View full text Add to dashboard Cite

Wall-clock convergence time and communication load are key performance metrics for the distributed implementation of stochastic gradient descent (SGD) in parameter server settings. Communication-adaptive distributed Adam (CADA) has been recently proposed as a way to reduce communication load via the adaptive selection of workers. CADA is subject to performance degradation in terms of wall-clock convergence time in the presence of stragglers. This paper proposes a novel scheme named grouping-based CADA (G-CADA) that retains the advantages of CADA in reducing the communication load, while increasing the robustness to stragglers at the cost of additional storage at the workers. G-CADA partitions the workers into groups of workers that are assigned the same data shards. Groups are scheduled adaptively at each iteration, and the server only waits for the fastest worker in each selected group. We provide analysis and experimental results to elaborate the significant gains on the wall-clock time, as well as communication load and computation load, of G-CADA over other benchmark schemes.

show abstract

Section: Psmentioning

confidence: 99%

Adaptive Worker Grouping For Communication-Efficient and Straggler-Tolerant Distributed SGD

Zhu¹,

Zhang²,

Simeone³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Coded computing methods with code rate r (a quantity between 0 and 1) make it possible to either recover the gradient exactly (e.g., [4]) or an approximation thereof (e.g., [5], [6], [21], [22]) from intermediate results computed by a subset of the workers, at the expense of increasing the computational load of each worker by a factor 1/r relative to GD. The gradient is recovered via a decoding operation (that typically reduces to solving a system of linear equations), the complexity of which usually increases superlinearly with the number of workers.…”

Section: Coded Computingmentioning

confidence: 99%

“…Finally, for both PCA and logistic regression, the straggler resiliency afforded by coding is canceled out by the higher computational load. Here, we consider a code rate r = 45/49, which we find yields lower latency compared to the lower rates typically used in coded computing (e.g., in [4], [5], [6], [21], [22]).…”

Section: Artificial Scenariomentioning

confidence: 99%

“…However, a smaller stepsize reduces the rate of convergence, and it is difficult to determine the correct rate at which to reduce the stepsize. Approximate coded computing methods combine ignoring stragglers SGD with redundancy, e.g., [5], [21], [22]. These methods improve the rate of convergence per iteration compared to ignoring stragglers SGD but typically do not converge to the optimum, and typically increase the computational load compared to GD by a factor 2 or 3.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

DSAG: A mixed synchronous-asynchronous iterative method for straggler-resilient learning

Severinson¹,

Rosnes²,

Rouayheb³

et al. 2021

Preprint

View full text Add to dashboard Cite

We consider straggler-resilient learning. In many previous works, e.g., in the coded computing literature, straggling is modeled as random delays that are independent and identically distributed between workers. However, in many practical scenarios, a given worker may straggle over an extended period of time. We propose a latency model that captures this behavior and is substantiated by traces collected on Microsoft Azure, Amazon Web Services (AWS), and a small local cluster. Building on this model, we propose DSAG, a mixed synchronous-asynchronous iterative optimization method, based on the stochastic average gradient (SAG) method, that combines timely and stale results. We also propose a dynamic load-balancing strategy to further reduce the impact of straggling workers. We evaluate DSAG for principal component analysis, cast as a finite-sum optimization problem, of a large genomics dataset, and for logistic regression on a cluster composed of 100 workers on AWS, and find that DSAG is up to about 50% faster than SAG, and more than twice as fast as coded computing methods, for the particular scenario that we consider.

show abstract

“…In this paper, we focus on distributed machine learning setup, where the aim is to implement the iterative gradient descent algorithm. Coding techniques used in this setup are termed as gradient coding [2]- [9]. In gradient coding, the key idea is to create data partitions with coded redundancy such that they are robust to stragglers.…”

Section: Introductionmentioning

confidence: 99%

Approximate Gradient Coding for Heterogeneous Nodes

Johri

Yardi

Bodas

2021

2021 IEEE Information Theory Workshop (ITW)

View full text Add to dashboard Cite

In distributed machine learning (DML), the training data is distributed across multiple worker nodes to perform the underlying training in parallel. One major problem affecting the performance of DML algorithms is presence of stragglers. These are nodes that are terribly slow in performing their task which results in under-utilization of the training data that is stored in them. Towards this, gradient coding mitigates the impact of stragglers by adding sufficient redundancy in the data. Gradient coding and other straggler mitigation schemes assume that the straggler behavior of the worker nodes is identical. Our experiments on the Amazon AWS cluster however suggest otherwise and we see that there is a correlation in the straggler behavior across iterations. To model this, we introduce a heterogeneous straggler model where nodes are categorized into two classes, slow and active. To better utilize training data stored with slow nodes, we modify the existing gradient coding schemes with shuffling of the training data among workers. Our results (both simulation and cloud experiments) suggest remarkable improvement with shuffling over existing schemes. We perform theoretical analysis for the proposed models justifying their utility.

show abstract

ErasureHead: Distributed Gradient Descent without Delays Using Approximate Gradient Coding

Cited by 19 publications

References 25 publications

Adaptive Worker Grouping For Communication-Efficient and Straggler-Tolerant Distributed SGD

Adaptive Worker Grouping For Communication-Efficient and Straggler-Tolerant Distributed SGD

DSAG: A mixed synchronous-asynchronous iterative method for straggler-resilient learning

Approximate Gradient Coding for Heterogeneous Nodes

Contact Info

Product

Resources

About