Summary
Repairs of multiple failures in distributed storage systems have posed the challenges for erasure coding: how to minimize the repair time with the least extra repair network traffic cost. However, existing repair schemes designed for single failure suffer from the high network traffic cost due to the serial repairs for multiple failures. Repair schemes designed for multiple failures suffer from long repair time due to the centralized repair structure. In this paper, we propose a decentralized adaptive repair scheme, called DARS, to minimize the repair time with the least extra network traffic cost. Specially, we propose a three‐layer repair model to support the repairs for both the single and multiple failures. For low repair time, a bandwidth‐aware node selection technique is proposed to guide the selection of nodes, and a line‐structured data transmission technique is proposed to organize the data transmission between the providers and the newcomer. For the least extra network traffic cost, a core‐based data distribution technique is proposed to organize the data transmission between the coordinator and other newcomers, and an intersection provider adjustment technique is proposed to adaptively adjust the number of intersection providers. Moreover, we adopt the ‘lazy repair’ within a stripe to further reduce the repair network traffic cost. We implement and evaluate DARS on our raid distributed storage system under various parameter settings with 30 physical machines and 200 virtual machines. Experimental results confirm that DARS reduces the repair time by 29% and 55% on average compared with tree‐structured repair and CORE, respectively. Copyright © 2015 John Wiley & Sons, Ltd.