ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8682911
|View full text |Cite
|
Sign up to set email alerts
|

Computation Scheduling for Distributed Machine Learning with Straggling Workers

Abstract: We study scheduling of computation tasks across n workers in a large scale distributed learning problem with the help of a master. Computation and communication delays are assumed to be random, and redundant computations are assigned to workers in order to tolerate stragglers. We consider sequential computation of tasks assigned to a worker, while the result of each computation is sent to the master right after its completion. Each computation round, which can model an iteration of the stochastic gradient desc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
19
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
2

Relationship

3
3

Authors

Journals

citations
Cited by 30 publications
(19 citation statements)
references
References 41 publications
0
19
0
Order By: Relevance
“…Recent literature (including our work) [12], [1], [21], [2], [3], [4], [22], [23], [24] proposes methods to exploit the work completed by stragglers, rather than ignoring it. The underlying idea is to assign each worker a sequence of multiple small subtasks rather than a single large task.…”
Section: A Background: Stragglers and Coded Computingmentioning
confidence: 99%
See 2 more Smart Citations
“…Recent literature (including our work) [12], [1], [21], [2], [3], [4], [22], [23], [24] proposes methods to exploit the work completed by stragglers, rather than ignoring it. The underlying idea is to assign each worker a sequence of multiple small subtasks rather than a single large task.…”
Section: A Background: Stragglers and Coded Computingmentioning
confidence: 99%
“…We extended special cases of hierarchical coding in [3], [4]. More recent works that aim to exploit stragglers including [23], [24], [22] complement the idea presented in [21] and the idea we present in [1]. In [22] each worker is tasked by a specified fraction of coded and uncoded computations.…”
Section: A Background: Stragglers and Coded Computingmentioning
confidence: 99%
See 1 more Smart Citation
“…where η t denotes the learning rate at iteration t, and shares the result with the devices for the computations at the following iterations. Although parallelism reduces the computation load at each device, communication from the devices to the PS becomes the main performance bottleneck [1]- [5], particularly for wireless edge learning due to limited bandwidth and power. Several architectures have been proposed in recent years to employ computational capabilities of edge devices, and train an ML model collaboratively with the help of a remote PS.…”
Section: Introductionmentioning
confidence: 99%
“…Since task completion time depends on the slowest worker, a key bottleneck in distributed computing is the straggler effect: experiments on Amazon EC2 instances show that some workers can be 5 times slower than the typical performance [3]. This Straggler effect can be mitigated by adding redundancy to the distributed computing system via coding [2]- [8], or by scheduling computation tasks [9]- [11]. Maximum distance separable (MDS) codes are widely applied for matrix multiplications [2]- [7], which can reduce the task completion time by O(log N ), where N is the number of workers [2].…”
Section: Introductionmentioning
confidence: 99%