Coded Matrix Multiplication on a Group-Based Model

Kim, Muah; Sohn, Jy-yong; Moon, Jaekyun

doi:10.1109/isit.2019.8849317

Cited by 29 publications

(21 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While most CDC schemes consider homogeneous computing nodes, there have been a few recent studies that investigated CDC over heterogeneous computing clusters. In particular, Kim et al [32], [33] considered the matrixvector multiplication problem and presented an optimal load allocation method that achieves a lower bound of the expected latency. Reisizadeh et al [21] introduced a different approach, namely Heterogeneous Coded Matrix Multiplication (HCMM), that can maximize the expected computing results aggregated at the master node.…”

Section: Related Workmentioning

confidence: 99%

On Batch-Processing Based Coded Computing for Heterogeneous Distributed Computing Systems

Wang

Xie

et al. 2021

IEEE Trans. Netw. Sci. Eng.

View full text Add to dashboard Cite

In recent years, coded distributed computing (CDC) has attracted significant attention, because it can efficiently facilitate many delay-sensitive computation tasks against unexpected latencies in distributed computing systems. Despite such a salient feature, many design challenges and opportunities remain. In this paper, we focus on practical computing systems with heterogeneous computing resources, and design a novel CDC approach, called batch-processing based coded computing (BPCC), which exploits the fact that every computing node can obtain some coded results before it completes the whole task. To this end, we first describe the main idea of the BPCC framework, and then formulate an optimization problem for BPCC to minimize the task completion time by configuring the computation load. Through formal theoretical analyses, extensive simulation studies, and comprehensive real experiments on the Amazon EC2 computing clusters, we demonstrate promising performance of the proposed BPCC scheme, in terms of high computational efficiency and robustness to uncertain disturbances.

show abstract

Section: Related Workmentioning

confidence: 99%

On Batch-Processing Based Coded Computing for Heterogeneous Distributed Computing Systems

Wang

Xie

et al. 2021

IEEE Trans. Netw. Sci. Eng.

View full text Add to dashboard Cite

show abstract

“…Since the following equality holds

PS can obtain the full gradient receiving the computation results from all the workers. In contrast to the naive approach, coded computation schemes for distributed matrix multiplication [ 22 , 23 , 32 , 34 ] first encode the submatrices, and then assign them to the workers to achieve a certain tolerance against slow/straggling workers.…”

Section: An Overview Of Existing Straggler Avoidance Techniquesmentioning

confidence: 99%

“…A wealth of straggler avoidance techniques have been proposed in recent years for DGD as well as other distributed computation tasks [ 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 , 33 , 34 , 35 , 36 , 37 , 38 , 39 , 40 , 41 , 42 , 43 , 44 , 45 , 46 , 47 , 48 ]. The common design notion behind all these schemes is the assignment of redundant computations/tasks to workers, such that faster workers can compensate for the stragglers.…”

Section: Introductionmentioning

confidence: 99%

Straggler-Aware Distributed Learning: Communication–Computation Latency Trade-Off

Ozfatura

Ulukuş

Gündüz

2020

Entropy

View full text Add to dashboard Cite

When gradient descent (GD) is scaled to many parallel workers for large-scale machine learning applications, its per-iteration computation time is limited by straggling workers. Straggling workers can be tolerated by assigning redundant computations and/or coding across data and computations, but in most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations. Imposing such a limitation results in two drawbacks: over-computation due to inaccurate prediction of the straggling behavior, and under-utilization due to discarding partial computations carried out by stragglers. To overcome these drawbacks, we consider multi-message communication (MMC) by allowing multiple computations to be conveyed from each worker per iteration, and propose novel straggler avoidance techniques for both coded computation and coded communication with MMC. We analyze how the proposed designs can be employed efficiently to seek a balance between the computation and communication latency. Furthermore, we identify the advantages and disadvantages of these designs in different settings through extensive simulations, both model-based and real implementation on Amazon EC2 servers, and demonstrate that proposed schemes with MMC can help improve upon existing straggler avoidance schemes.

show abstract

“…In practical distributed computing systems, some processing nodes have the same computational capabilities, in terms of the same distributions of computation time, and thus they can be grouped together. By exploiting the group structure and heterogeneities among different groups of processing nodes [141], [142], the implementation of a combination of group codes and an optimal load allocation strategy not only approaches the optimal computation time that is achieved by the MDS codes, but also has low decoding complexity. In addition, by varying the number of allocated rows of the matrix to the workers [142], the computation latency can be reduced by orders of magnitude over the MDS codes with fixed computation load allocation [141] as the number of workers increases.…”

Section: A Computation Load Allocationmentioning

confidence: 99%

“…By exploiting the group structure and heterogeneities among different groups of processing nodes [141], [142], the implementation of a combination of group codes and an optimal load allocation strategy not only approaches the optimal computation time that is achieved by the MDS codes, but also has low decoding complexity. In addition, by varying the number of allocated rows of the matrix to the workers [142], the computation latency can be reduced by orders of magnitude over the MDS codes with fixed computation load allocation [141] as the number of workers increases. The load allocation strategy proposed in [142] focuses mainly on the design of an optimal MDS code.…”

Section: A Computation Load Allocationmentioning

confidence: 99%

A Survey of Coded Distributed Computing

Ng¹,

Lim²,

Luong³

et al. 2020

Preprint

View full text Add to dashboard Cite

Distributed computing has become a common approach for large-scale computation of tasks due to benefits such as high reliability, scalability, computation speed, and costeffectiveness. However, distributed computing faces critical issues related to communication load and straggler effects. In particular, computing nodes need to exchange intermediate results with each other in order to calculate the final result, and this significantly increases communication overheads. Furthermore, a distributed computing network may include straggling nodes that run intermittently slower. This results in a longer overall time needed to execute the computation tasks, thereby limiting the performance of distributed computing. To address these issues, coded distributed computing (CDC), i.e., a combination of coding theoretic techniques and distributed computing, has been recently proposed as a promising solution. Coding theoretic techniques have proved effective in WiFi and cellular systems to deal with channel noise. Therefore, CDC may significantly reduce communication load, alleviate the effects of stragglers, provide fault-tolerance, privacy and security. In this survey, we first introduce the fundamentals of CDC, followed by basic CDC schemes. Then, we review and analyze a number of CDC approaches proposed to reduce the communication costs, mitigate the straggler effects, and guarantee privacy and security. Furthermore, we present and discuss applications of CDC in modern computer networks. Finally, we highlight important challenges and promising research directions related to CDC.

show abstract

Coded Matrix Multiplication on a Group-Based Model

Cited by 29 publications

References 20 publications

On Batch-Processing Based Coded Computing for Heterogeneous Distributed Computing Systems

On Batch-Processing Based Coded Computing for Heterogeneous Distributed Computing Systems

Straggler-Aware Distributed Learning: Communication–Computation Latency Trade-Off

A Survey of Coded Distributed Computing

Contact Info

Product

Resources

About