Evaluating impact of human errors on the availability of data storage systems

2020

IEEE Trans. Comput.

Self Cite

Emergence of Solid-State Drives (SSDs) have evolved the data storage industry where they are rapidly replacing Hard Disk Drives (HDDs) due to their superiority in performance and power. Meanwhile, SSDs have reliability issues due to bit errors, bad blocks, and bad chips. To help reliability, Redundant Array of Independent Disks (RAID) configurations, originally proposed to increase both performance and reliability of HDDs, are also applied to SSD arrays. However, the conventional reliability models of HDD RAID cannot be intactly applied to SSD arrays, as the nature of failures in SSDs are totally different from HDDs. Previous studies on the reliability of SSD arrays are based on the deprecated SSD failure data, and only focus on limited failure types, device failures, and page failures caused by the bit errors, while recent field studies have reported other failure types including bad blocks and bad chips, and a high correlation between failures.In this paper, we investigate the reliability of SSD arrays using field storage traces and real-system implementation of conventional and emerging erasure codes. The reliability is evaluated by statistical fault injection experiments that post-process the usage logs obtained from the real-system implementation, while the fault/failure attributes are obtained from the state-of-the-art field data by previous works. As a case study, we examine conventional RAID5 and RAID6 and emerging Partial-MDS (PMDS) codes, Sector-Disk (SD) codes, and STAIR codes in terms of both reliability and performance using an open-source software RAID controller, MD (in Linux kernel version 3.10.0-327), and arrays of Samsung 850 Pro SSDs.Our detailed analysis on the data loss breakdown shows that a) emerging erasure codes fail to replace RAID6 in terms of reliability, b) row-wise erasure codes are the most efficient choices for contemporary SSD devices, and c) previous models overestimate the SSD array reliability by up to six orders of magnitude, as they just focus on the coincidence of bad pages (bit errors) and bad chips within a data stripe that holds the minority of root cause of data loss in SSD arrays. Our experiments show that the combination of bad chips with bad blocks is recognized as the major source of data loss in RAID5 and emerging codes (contributing more than 54% and 90% of data loss in RAID5 and emerging codes, respectively), while RAID6 remains robust under these failure combinations. Finally, the fault injection results reveal that SSD array reliability, as well as the failure breakdown is significantly correlated with SSD type. * **2 RBER is defined as the number of corrupted bits over the total number of read bits (including both correctable and uncorrectable errors) [37].

Section: B Analysis and Modeling Of Ssd Array Reliabilitymentioning

confidence: 99%

A Modeling Framework for Reliability of Erasure Codes in SSD Arrays

Kishani

Ahmadian

2020

IEEE Trans. Comput.

Self Cite

“…Increasing number of I/O intensive applications such as Online Transaction Processing (OLTP), High Performance Computing (HPC), web, and email applications arises the demand in data-centers for high-performance storage systems. The most common approach to improving the performance of storage systems is to employ Solid-State Drives (SSDs) [1] in the caching layer of the disk subsystems [2], [3], [4], [5], [6], which are mainly built upon low-performance and lowreliable Hard Disk Drives (HDD) [7], [8], [9] or mid-range SSDs (as shown in Fig. 1).…”

Section: Introductionmentioning

confidence: 99%

LBICA: A Load Balancer for I/O Cache Architectures

Ahmadian

Salkhordeh

2019 Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE)

2019

Self Cite

In recent years, enterprise Solid-State Drives (SSDs) are used in the caching layer of high-performance servers to close the growing performance gap between processing units and storage subsystem. SSD-based I/O caching is typically not effective in workloads with burst accesses in which the caching layer itself becomes the performance bottleneck because of the large number of accesses. Existing I/O cache architectures mainly focus on maximizing the cache hit ratio while they neglect the average queue time of accesses. Previous studies suggested bypassing the cache when burst accesses are identified. These schemes, however, are not applicable to a general cache configuration and also result in significant performance degradation on burst accesses.In this paper, we propose a novel I/O cache load balancing scheme (LBICA) with adaptive write policy management to prevent the I/O cache from becoming performance bottleneck in burst accesses. Our proposal, unlike previous schemes, which disable the I/O cache or bypass the requests into the disk subsystem in burst accesses, selectively reduces the number of waiting accesses in the SSD queue and balances the load between the I/O cache and the disk subsystem while providing the maximum performance. The proposed scheme characterizes the workload based on the type of in-queue requests and assigns an effective cache write policy. We aim to bypass the accesses which 1) are served faster by the disk subsystem or 2) cannot be merged with other accesses in the I/O cache queue. Doing so, the selected requests are responded by the disk layer, preventing from overloading the I/O cache. Our evaluations on a physical system shows that LBICA reduces the load on the I/O cache by 48% and improves the performance of burst workloads by 30% compared to the latest state-of-the-art load balancing scheme.

“…The availability and reliability of Information systems is seriously affected by human errors [1], [2], [3], [4] where some field studies report human errors as the cause of 19% of system failures [5], [3]. Large datacenters with Exa-Byte (EB) storage capacity (by employing millions of disks drives) are expected to face at least a disk failure per hour.…”

Section: Introductionmentioning

confidence: 99%

“…To this end, we analyze the possible combinations of operational 3 A task that removes LSEs by periodically reading the disk data and checking it with its parity, correcting the corrupted data using the parity and moving it to a new location, and mapping out the damaged sectors. 4 An event in which the whole data of RAID5 array is lost, due to the consecutive failure of two disks. 5 While the incorrect repair service can have many different roots and happen in many different conditions, in this work we focus on IDRS.…”

Section: Introductionmentioning

confidence: 99%

“…• The proposed model is extended to consider the effect of a) LSEs for RAID5 arrays and b) RAID5 with spare disk. • Models in [4] assume a 100% survivable storage system 8 , while this work assumes the general case in which parts of data can be non-survivable. • For the first time, a novel metric, NOMDU, is proposed to assess the availability of data storage systems.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Modeling Impact of Human Errors on the Data Unavailability and Data Loss of Storage Systems

Kishani

2018

IEEE Trans. Rel.

Self Cite

Data storage systems and their availability play a crucial role in contemporary datacenters. Despite using mechanisms such as automatic fail-over in datacenters, the role of human agents and consequently their destructive errors is inevitable. Due to very large number of disk drives used in exascale datacenters and their high failure rates, the disk subsystem in storage systems has become a major source of Data Unavailability (DU) and Data Loss (DL) initiated by human errors. In this paper, we investigate the effect of Incorrect Disk Replacement Service (IDRS) on the availability and reliability of data storage systems. To this end, we analyze the consequences of IDRS in a disk array, and conduct Monte Carlo simulations to evaluate DU and DL during mission time. The proposed modeling framework can cope with a) different storage array configurations and b) Data Object Survivability (DOS), representing the effect of system level redundancies such as remote backups and mirrors. In the proposed framework, the model parameters are obtained from industrial and scientific reports alongside field data which have been extracted from a datacenter operating with 70 storage racks. The results show that ignoring the impact of IDRS leads to unavailability underestimation by up to three orders of magnitude. Moreover, our study suggests that by considering the effect of human errors, the conventional beliefs about the dependability of different Redundant Array of Independent Disks (RAID) mechanisms should be revised. The results show that RAID1 can result in lower availability compared to RAID5 in the presence of human errors. The results also show that employing automatic fail-over policy (using hot spare disks) can reduce the drastic impacts of human errors by two orders of magnitude.