Computation Scheduling for Distributed Machine Learning with Straggling Workers

Amiri, Mohammad Mohammadi; Gündüz, Deniz

doi:10.1109/icassp.2019.8682911

Cited by 30 publications

(19 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent literature (including our work) [12], [1], [21], [2], [3], [4], [22], [23], [24] proposes methods to exploit the work completed by stragglers, rather than ignoring it. The underlying idea is to assign each worker a sequence of multiple small subtasks rather than a single large task.…”

Section: A Background: Stragglers and Coded Computingmentioning

confidence: 99%

“…We extended special cases of hierarchical coding in [3], [4]. More recent works that aim to exploit stragglers including [23], [24], [22] complement the idea presented in [21] and the idea we present in [1]. In [22] each worker is tasked by a specified fraction of coded and uncoded computations.…”

Section: A Background: Stragglers and Coded Computingmentioning

confidence: 99%

“…In [23] multiple coded subtasks assigned to each worker are generated according to the characteristics of universally decodable matrices. In [24] each worker is tasked with completing a fully-uncoded series of subtasks with respect to a predesigned computation order.…”

Section: A Background: Stragglers and Coded Computingmentioning

confidence: 99%

See 2 more Smart Citations

Hierarchical coded matrix multiplication

Kiani

Ferdinand

Draper

2019

2019 16th Canadian Workshop on Information Theory (CWIT)

View full text Add to dashboard Cite

In distributed computing systems slow working nodes, known as stragglers, can greatly extend finishing times. Coded computing is a technique that enables straggler-resistant computation. Most coded computing techniques presented to date provide robustness by ensuring that the time to finish depends only on a set of the fastest nodes. However, while stragglers do compute less work than non-stragglers, in real-world commercial cloud computing systems (e.g., Amazon's Elastic Compute Cloud (EC2)) the distinction is often a soft one. In this paper, we develop hierarchical coded computing that exploits the work completed by all nodes, both fast and slow, automatically integrating the potential contribution of each. We first present a conceptual framework to represent the division of work amongst nodes in coded matrix multiplication as a cuboid partitioning problem. This framework allows us to unify existing methods and motivates new techniques. We then develop three methods of hierarchical coded computing that we term bit-interleaved coded computation (BICC), multilevel coded computation (MLCC), and hybrid hierarchical coded computation (HHCC). In this paradigm, each worker is tasked with completing a sequence (a hierarchy) of ordered subtasks. The sequence of subtasks, and the complexity of each, is designed so that partial work completed by stragglers can be used in, rather than ignored. We note that our methods can be used in conjunction with any coded computing method. We illustrate this showing how we can use our methods to accelerate all previously developed coded computing technique by enabling them to exploit stragglers. Under a widely studied statistical model of completion time, our approach realizes a 66% improvement in expected finishing time. On Amazon EC2, the gain was 28% when stragglers are simulated.

show abstract

Section: A Background: Stragglers and Coded Computingmentioning

confidence: 99%

Section: A Background: Stragglers and Coded Computingmentioning

confidence: 99%

See 1 more Smart Citation

Hierarchical coded matrix multiplication

Kiani

Ferdinand

Draper

2019

2019 16th Canadian Workshop on Information Theory (CWIT)

View full text Add to dashboard Cite

show abstract

“…where η t denotes the learning rate at iteration t, and shares the result with the devices for the computations at the following iterations. Although parallelism reduces the computation load at each device, communication from the devices to the PS becomes the main performance bottleneck [1]- [5], particularly for wireless edge learning due to limited bandwidth and power. Several architectures have been proposed in recent years to employ computational capabilities of edge devices, and train an ML model collaboratively with the help of a remote PS.…”

Section: Introductionmentioning

confidence: 99%

Collaborative Machine Learning at the Wireless Edge with Blind Transmitters

Amiri

Duman

Gündüz

2019

2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP)

Self Cite

View full text Add to dashboard Cite

We study wireless collaborative machine learning (ML), where mobile edge devices, each with its own dataset, carry out distributed stochastic gradient descent (DSGD) over-the-air with the help of a wireless access point acting as the parameter server (PS). At each iteration of the DSGD algorithm wireless devices compute gradient estimates with their local datasets, and send them to the PS over a wireless fading multiple access channel (MAC). Motivated by the additive nature of the wireless MAC, we propose an analog DSGD scheme, in which the devices transmit scaled versions of their gradient estimates in an uncoded fashion. We assume that the channel state information (CSI) is available only at the PS. We instead allow the PS to employ multiple antennas to alleviate the destructive fading effect, which cannot be cancelled by the transmitters due to the lack of CSI. Theoretical analysis indicates that, with the proposed DSGD scheme, increasing the number of PS antennas mitigates the fading effect, and, in the limit, the effects of fading and noise disappear, and the PS receives aligned signals used to update the model parameter. The theoretical results are then corroborated with the experimental ones.

show abstract

“…Since task completion time depends on the slowest worker, a key bottleneck in distributed computing is the straggler effect: experiments on Amazon EC2 instances show that some workers can be 5 times slower than the typical performance [3]. This Straggler effect can be mitigated by adding redundancy to the distributed computing system via coding [2]- [8], or by scheduling computation tasks [9]- [11]. Maximum distance separable (MDS) codes are widely applied for matrix multiplications [2]- [7], which can reduce the task completion time by O(log N ), where N is the number of workers [2].…”

Section: Introductionmentioning

confidence: 99%

Heterogeneous Coded Computation across Heterogeneous Workers

Sun

Zhao

Zhou

et al. 2019

2019 IEEE Global Communications Conference (GLOBECOM)

Self Cite

View full text Add to dashboard Cite

Coded distributed computing framework enables large-scale machine learning (ML) models to be trained efficiently in a distributed manner, while mitigating the straggler effect. In this work, we consider a multi-task assignment problem in a coded distributed computing system, where multiple masters, each with a different matrix multiplication task, assign computation tasks to workers with heterogeneous computing capabilities. Both dedicated and probabilistic worker assignment models are considered, with the objective of minimizing the average completion time of all computations. For dedicated worker assignment, greedy algorithms are proposed and the corresponding optimal load allocation is derived based on the Lagrange multiplier method. For probabilistic assignment, successive convex approximation method is used to solve the non-convex optimization problem. Simulation results show that the proposed algorithms reduce the completion time by 80% over uncoded scheme, and 49% over an unbalanced coded scheme.

show abstract

Computation Scheduling for Distributed Machine Learning with Straggling Workers

Cited by 30 publications

References 41 publications

Hierarchical coded matrix multiplication

Hierarchical coded matrix multiplication

Collaborative Machine Learning at the Wireless Edge with Blind Transmitters

Heterogeneous Coded Computation across Heterogeneous Workers

Contact Info

Product

Resources

About