Erasure Coding for Distributed Storage: An Overview

Balaji, Sai A.; Krishnan, M. Nikhil; Vajha, Myna; Ramkumar, Vinayak; Sasidharan, Birenjith; Kumar, P. Vijay

doi:10.48550/arxiv.1806.04437

Cited by 3 publications

(6 citation statements)

References 111 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…, s − 1} we download one symbol of F from each of the d helper racks. There are s n−1 subsets of the above form, and thus the total repair bandwidth is d s n−1 = dl s , proving the optimality claim of the code according to (2).…”

Section: Rack-aware Codes With Optimal Repair For All Parametersmentioning

confidence: 85%

“…We will show that the code defined in ( 9) is an MDS code that has the smallest possible repair bandwidth according to the bound (2). Before stating the main theorem that proves these claims let us comment on the origin as well as the new elements in this construction.…”

Section: Rack-aware Codes With Optimal Repair For All Parametersmentioning

confidence: 92%

“…The problems of centralized and cooperative repair have been addressed in multiple recent papers, and there are explicit constructions of optimal-repair regenerating codes that cover the entire range of admissible parameters, require small-size ground alphabet compared to the length n of the encoding block, and attain the smallest possible repair bandwidth [16], [23], [28], [19], [27], [30], [13] (more references are given in a recent survey [2]). The availability of optimal constructions has motivated a shift of attention toward studying data recovery not only under communication, but also connectivity constraints, in other words, storage models in which communication cost between nodes differs depending on their location in the storage cluster.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Explicit Constructions of MSR Codes for Clustered Distributed Storage: The Rack-Aware Storage Model

Chen

Barg

2020

IEEE Trans. Inform. Theory

View full text Add to dashboard Cite

The paper is devoted to the problem of erasure coding in distributed storage. We consider a model of storage that assumes that nodes are organized into equally sized groups, called racks, that within each group the nodes can communicate freely without taxing the system bandwidth, and that the only information transmission that counts is the one between the racks. This assumption implies that the nodes within each of the racks can collaborate before providing information to the failed node. The main emphasis of the paper is on code construction for this storage model. We present an explicit family of MDS array codes that support recovery of a single failed node from any number of helper racks using the minimum possible amount of inter-rack communication (such codes are said to provide optimal repair). The codes are constructed over finite fields of size comparable to the code length.We also derive a bound on the number of symbols accessed at helper nodes for the purposes of repair, and construct a code family that approaches this bound, while still maintaining the optimal repair property.Finally, we present a construction of scalar Reed-Solomon codes that support optimal repair for the rackoriented storage model.The problems of centralized and cooperative repair have been addressed in multiple recent papers, and there are explicit constructions of optimal-repair regenerating codes that cover the entire range of admissible parameters, require small-size ground alphabet compared to the length n of the encoding block, and attain the smallest possible repair bandwidth [16], [23], [28], [19], [27], [30], [13] (more references are given in a recent survey [2]). The availability of optimal constructions has motivated a shift of attention toward studying data recovery not only under communication, but also connectivity constraints, in other words, storage models in which communication cost between nodes differs depending on their location in the storage cluster. One of the simple extensions from the basic setting of homogeneous storage suggests that the nodes are joined into several groups (clusters), and repair of a node can be based on information from both the nodes within its own group and from nodes in the other groups. This permits to differentiate between communication within the cluster and the inter-cluster downloads, and the natural assumption is that the former is easier (contributes less to the repair bandwidth) than the latter.Erasure coding for clustered architectures was introduced several years ago and affords several variations. One of the first questions analyzed for heterogeneous storage models was related to repair under the condition that the system contains a group of nodes, downloading information from which contributes more to the repair bandwidth than downloading the same amount of information from the other nodes [1]. Later works [6], [14] observed that a more realistic version of non-homogeneous storage should assume that the cost of downloading information depends on the relative location of th...

show abstract

Section: Rack-aware Codes With Optimal Repair For All Parametersmentioning

confidence: 85%

Section: Rack-aware Codes With Optimal Repair For All Parametersmentioning

confidence: 92%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Explicit Constructions of MSR Codes for Clustered Distributed Storage: The Rack-Aware Storage Model

Chen

Barg

2020

IEEE Trans. Inform. Theory

View full text Add to dashboard Cite

show abstract

“…Distributed systems such as Hadoop, AT&T Cloud Storage, Google File System and Windows Azure have evolved to support different types of erasure codes, in order to achieve the benefits of improved storage efficiency while providing the same reliability as replicationbased schemes (Balaji et al, 2018). Various erasure code plug-ins and libraries have been developed in storage systems like Ceph (Weil et al, 2006;Aggarwal et al, 2017a), Tahoe (Xiang et al, 2016), Quantcast (QFS) (Ovsiannikov et al, 2013), and Hadoop (HDFS) (Rashmi et al, 2014).…”

Section: Erasure Coding In Distributed Storagementioning

confidence: 99%

“…Distributed systems such as Hadoop, AT&T Cloud Storage, Google File System and Windows Azure have evolved to support different Lessons from prototype implementation types of erasure codes, in order to achieve the benefits of improved storage efficiency while providing the same reliability as replicationbased schemes (Balaji et al, 2018). In particular, Reed-Solomon (RS) codes have been implemented in the Azure production cluster and resulted in the savings of millions of dollars for Microsoft (Huang et al, 2012a;blog, 2012).…”

Section: Exemplary Implementation Of Erasure-coded Storagementioning

confidence: 99%

Modeling and Optimization of Latency in Erasure-coded Storage Systems

Aggarwal¹,

Lan²

2020

Preprint

View full text Add to dashboard Cite

As consumers are increasingly engaged in social networking and E-commerce activities, businesses grow to rely on Big Data analytics for intelligence, and traditional IT infrastructures continue to migrate to the cloud and edge, these trends cause distributed data storage demand to rise at an unprecedented speed. Erasure coding has seen itself quickly emerged as a promising technique to reduce storage cost while providing similar reliability as replicated systems, widely adopted by companies like Facebook, Microsoft and Google. However, it also brings new challenges in characterizing and optimizing the access latency when erasure codes are used in distributed storage. The aim of this monograph is to provide a review of recent progress (both theoretical and practical) on systems that employ erasure codes for distributed storage.In this monograph, we will first identify the key challenges and taxonomy of the research problems and then give an overview of different approaches that have been developed to quantify and model latency of erasure-coded storage. This includes recent work leveraging MDS-Reservation, Fork-Join, Probabilistic, and Delayed-Relaunch scheduling policies, as well as their applications to characterize access latency (e.g., mean, tail, asymptotic latency) of erasure-coded distributed storage systems. We will also extend the problem to the case when users are streaming videos from erasure-coded distributed storage systems. Next, we bridge the gap between theory and practice, and discuss lessons learned from prototype implementation. In particular, we will discuss exemplary implementations of erasure-coded storage, illuminate key design degrees of freedom and tradeoffs, and summarize remaining challenges in real-world storage systems such as in content delivery and caching. Open problems for future research are discussed at the end of each chapter.

show abstract

Explicit constructions of MSR codes for clustered distributed storage: The rack-aware storage model

Chen

Barg

2019

Preprint

View full text Add to dashboard Cite

Erasure Coding for Distributed Storage: An Overview

Cited by 3 publications

References 111 publications

Explicit Constructions of MSR Codes for Clustered Distributed Storage: The Rack-Aware Storage Model

Explicit Constructions of MSR Codes for Clustered Distributed Storage: The Rack-Aware Storage Model

Modeling and Optimization of Latency in Erasure-coded Storage Systems

Explicit constructions of MSR codes for clustered distributed storage: The rack-aware storage model

Contact Info

Product

Resources

About