Codes are widely used in many engineering applications to offer robustness against noise. In large-scale systems there are several types of noise that can affect the performance of distributed machine learning algorithms -straggler nodes, system failures, or communication bottlenecks -but there has been little interaction cutting across codes, machine learning, and distributed systems. In this work, we provide theoretical insights on how coded solutions can achieve significant gains compared to uncoded ones. We focus on two of the most basic building blocks of distributed learning algorithms: matrix multiplication and data shuffling. For matrix multiplication, we use codes to alleviate the effect of stragglers, and show that if the number of homogeneous workers is n, and the runtime of each subtask has an exponential tail, coded computation can speed up distributed matrix multiplication by a factor of log n. For data shuffling, we use codes to reduce communication bottlenecks, exploiting the excess in storage. We show that when a constant fraction α of the data matrix can be cached at each worker, and n is the number of workers, coded shuffling reduces the communication cost by a factor of (α + 1 n )γ(n) compared to uncoded shuffling, where γ(n) is the ratio of the cost of unicasting n messages to n users to multicasting a common message (of the same size) to n users. For instance, γ(n) n if multicasting a message to n users is as cheap as unicasting a message to one user. We also provide experiment results, corroborating our theoretical gains of the coded algorithms.
Distributed storage systems for large-scale applications typically use replication for reliability. Recently, erasure codes were used to reduce the large storage overhead, while increasing data reliability. A main limitation of off-the-shelf erasure codes is their high-repair cost during single node failure events. A major open problem in this area has been the design of codes that i) are repair efficient and ii) achieve arbitrarily high data rates.In this paper, we explore the repair metric of locality, which corresponds to the number of disk accesses required during a single node repair. Under this metric we characterize an information theoretic trade-off that binds together locality, code distance, and the storage capacity of each node. We show the existence of optimal locally repairable codes (LRCs) that achieve this trade-off. The achievability proof uses a locality aware flowgraph gadget which leads to a randomized code construction. Finally, we present an optimal and explicit LRC that achieves arbitrarily high data-rates. Our locality optimal construction is based on simple combinations of Reed-Solomon blocks.Parts of this work were presented in [1].
Codes are widely used in many engineering applications to offer robustness against noise. In large-scale systems there are several types of noise that can affect the performance of distributed machine learning algorithms -straggler nodes, system failures, or communication bottlenecks -but there has been little interaction cutting across codes, machine learning, and distributed systems. In this work, we provide theoretical insights on how coded solutions can achieve significant gains compared to uncoded ones. We focus on two of the most basic building blocks of distributed learning algorithms: matrix multiplication and data shuffling. For matrix multiplication, we use codes to alleviate the effect of stragglers, and show that if the number of homogeneous workers is n, and the runtime of each subtask has an exponential tail, coded computation can speed up distributed matrix multiplication by a factor of log n. For data shuffling, we use codes to reduce communication bottlenecks, exploiting the excess in storage. We show that when a constant fraction α of the data matrix can be cached at each worker, and n is the number of workers, coded shuffling reduces the communication cost by a factor of (α + 1 n )γ(n) compared to uncoded shuffling, where γ(n) is the ratio of the cost of unicasting n messages to n users to multicasting a common message (of the same size) to n users. For instance, γ(n) n if multicasting a message to n users is as cheap as unicasting a message to one user. We also provide experiment results, corroborating our theoretical gains of the coded algorithms. input data dist. store aggregate function Communication graph storage phase communication phase computation phase 1 2 3 1 2 3 Fig. 1: Conceptual diagram of the phases of distributed computation. The algorithmic workflow of distributed (potentially iterative) tasks, can be seen as receiving input data, storing them in distributed nodes, communicating data around the distributed network, and then computing locally a function at each distributed node. The main bottlenecks in this execution (communication, stragglers, system failures) can all be abstracted away by incorporating a notion of delays between these phases, denoted by ∆ boxes.reduced communication cost for data shuffling done in parallel machine learning algorithms. We show that when a constant fraction of the data matrix can be cached at each worker, and n is the number of workers, coded shuffling reduces the communication cost by a factor Θ(γ(n)) compared to uncoded shuffling, where γ(n) is the ratio of the cost of unicasting n messages to n users to multicasting a common message (of the same size) to n users. For instance, γ(n) n if multicasting a message to n users is as cheap as unicasting a message to one user. We would like to remark that a major innovation of our coding solutions is that they are woven into the fabric of the algorithmic design, and coding/decoding is performed over the representation field of the input data (e.g., floats or doubles).In sharp contrast to most coding ap...
Petabyte-scale distributed storage systems are currently transitioning to erasure codes to achieve higher storage efficiency. Classical codes like Reed-Solomon are highly sub-optimal for distributed environments due to their high overhead in single-failure events. Locally Repairable Codes (LRCs) form a new family of codes that are repair efficient. In particular, LRCs minimize the number of nodes participating in single node repairs during which they generate small network traffic. Two large-scale distributed storage systems have already implemented different types of LRCs: Windows Azure Storage and the Hadoop Distributed File System RAID used by Facebook. The fundamental bounds for LRCs, namely the best possible distance for a given code locality, were recently discovered, but few explicit constructions exist. In this work, we present an explicit and optimal LRCs that are simple to construct. Our construction is based on grouping Reed-Solomon (RS) coded symbols to obtain RS coded symbols over a larger finite field. We then partition these RS symbols in small groups, and re-encode them using a simple local code that offers low repair locality. For the analysis of the optimality of the code, we derive a new result on the matroid represented by the code's generator matrix.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.