Approximate Gradient Coding via Sparse Random Graphs

Charles, Zachary; Papailiopoulos, Dimitris S.; Ellenberg, Jordan S.

doi:10.48550/arxiv.1711.06771

Cited by 27 publications

(93 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• We provide a thorough analysis for these two schemes under shuffling and obtain expressions for their expected optimal decoding error. Our analysis for the FRC scheme extends the existing results [8,Thm. 6] for the heterogeneous straggler model.…”

Section: Introductionsupporting

confidence: 84%

“…Let A ∈ R n×r be the submatrix is formed by considering the columns of B that correspond to non-stragglers. This matrix A is termed as a non-straggler matrix [8]. For an (s − 1)-tolerant coding scheme, master is guaranteed to compute the exact gradient g when r ≥ n − s + 1 [2].…”

Section: System Modelmentioning

confidence: 99%

“…The average-case model corresponds to the case where the number of actual stragglers can be more than s max and the master might not always be able to compute the exact value of the gradient. In this case, the master computes an approximate gradient using the computations performed by the set of non-stragglers [8]- [12]. In many applications, the average-case straggler model is desirable as s max might not be known a priori.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Approximate Gradient Coding for Heterogeneous Nodes

Johri

Yardi

Bodas

2021

2021 IEEE Information Theory Workshop (ITW)

View full text Add to dashboard Cite

In distributed machine learning (DML), the training data is distributed across multiple worker nodes to perform the underlying training in parallel. One major problem affecting the performance of DML algorithms is presence of stragglers. These are nodes that are terribly slow in performing their task which results in under-utilization of the training data that is stored in them. Towards this, gradient coding mitigates the impact of stragglers by adding sufficient redundancy in the data. Gradient coding and other straggler mitigation schemes assume that the straggler behavior of the worker nodes is identical. Our experiments on the Amazon AWS cluster however suggest otherwise and we see that there is a correlation in the straggler behavior across iterations. To model this, we introduce a heterogeneous straggler model where nodes are categorized into two classes, slow and active. To better utilize training data stored with slow nodes, we modify the existing gradient coding schemes with shuffling of the training data among workers. Our results (both simulation and cloud experiments) suggest remarkable improvement with shuffling over existing schemes. We perform theoretical analysis for the proposed models justifying their utility.

show abstract

Section: Introductionsupporting

confidence: 84%

Section: System Modelmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Approximate Gradient Coding for Heterogeneous Nodes

Johri

Yardi

Bodas

2021

2021 IEEE Information Theory Workshop (ITW)

View full text Add to dashboard Cite

show abstract

“…One of the applications of coding theory is to enable error correction [50]. Gradient coding is originally proposed as a straggler mitigation method [106], which is used to speed up synchronous distributed first-order methods [17,23,89]. Several works build upon it and extend it to the adversarial setup.…”

Section: Gradient Codingmentioning

confidence: 99%

“…The major difference between Algorithms 2 and DGD is (17): instead of updating the estimates by the sum of all agents updates, BGD uses an aggregation rule GradFilter (•). Generally speaking, GradFilter : R d × n → R d is a function that takes n vectors of d-dimension, and output a vector of d-dimension.…”

mentioning

confidence: 99%

A Survey on Fault-tolerance in Distributed Optimization and Machine Learning

Liu

2021

Preprint

View full text Add to dashboard Cite

The robustness of distributed optimization is an emerging field of study, motivated by various applications of distributed optimization including distributed machine learning, distributed sensing, and swarm robotics. With the rapid expansion of the scale of distributed systems, resilient distributed algorithms for optimization are needed, in order to mitigate system failures, communication issues, or even malicious attacks. This survey investigates the current state of fault-tolerance research in distributed optimization, and aims to provide an overview of the existing studies on both fault-tolerant distributed optimization theories and applicable algorithms.

show abstract

Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers

Hanna

Bitar

Parag

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We consider the setting where a master wants to run a distributed stochastic gradient descent (SGD) algorithm on n workers each having a subset of the data. Distributed SGD may suffer from the effect of stragglers, i.e., slow or unresponsive workers who cause delays. One solution studied in the literature is to wait at each iteration for the responses of the fastest k < n workers before updating the model, where k is a fixed parameter. The choice of the value of k presents a trade-off between the runtime (i.e., convergence rate) of SGD and the error of the model. Towards optimizing the error-runtime trade-off, we investigate distributed SGD with adaptive k. We first design an adaptive policy for varying k that optimizes this trade-off based on an upper bound on the error as a function of the wallclock time which we derive. Then, we propose an algorithm for adaptive distributed SGD that is based on a statistical heuristic. We implement our algorithm and provide numerical simulations which confirm our intuition and theoretical analysis.

show abstract

Approximate Gradient Coding via Sparse Random Graphs

Cited by 27 publications

References 14 publications

Approximate Gradient Coding for Heterogeneous Nodes

Approximate Gradient Coding for Heterogeneous Nodes

A Survey on Fault-tolerance in Distributed Optimization and Machine Learning

Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers

Contact Info

Product

Resources

About