Cosmic rays don't strike twice

Hwang, Andy; Stefanovici, Ioan; Schroeder, Bianca

doi:10.1145/2150976.2150989

Cited by 146 publications

(7 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…According to early work at IBM (Ziegler and Lanford, 1979), errors affecting memory devices can be divided into two basic groups: hard errors are those caused by a physical defect, while soft errors are transient in nature and caused by some kind of electromagnetic interaction, such as a cosmic ray strike. Considerable work has been carried out on understanding the causes and effects of cosmic rays on silicon devices (Ziegler andLanford, 1979, 1981;Ziegler, 1996;Ziegler et al, 1996), in particular on their effect on DRAM devices (McKee and McAdams, 1996;Borucki et al, 2008;Fang et al, 2009;Hwang et al, 2012).…”

Section: Introductionmentioning

confidence: 99%

Application-based fault tolerance techniques for sparse matrix solvers

McIntosh–Smith

Hunt

Price

et al. 2017

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

High-performance computing (HPC) systems continue to increase in size in the quest for ever higher performance. The resulting increased electronic component count, coupled with the decrease in feature sizes of the silicon manufacturing processes used to build these components, may result in future Exascale systems being more susceptible to soft errors caused by cosmic radiation than current HPC systems. Through the use of techniques such as hardware-based errorcorrecting codes (ECC) and checkpoint-restart, many of these faults can be mitigated, but at the cost of increased hardware overhead, run-time, and energy consumption that can be as much as 10-20%. Some predictions expect these overheads to continue to grow over time. For extreme scale systems, these overheads will represent megawatts of power consumption and millions of dollars of additional hardware cost, which could potentially be avoided with more sophisticated fault-tolerance techniques. In this paper we present new software-based fault tolerance techniques that can be applied to one of the most important classes of software in HPC: iterative sparse matrix solvers. Our new techniques enables us to exploit knowledge of the structure of sparse matrices in such a way as to improve the performance, energy efficiency and fault tolerance of the overall solution.

show abstract

Section: Introductionmentioning

confidence: 99%

Application-based fault tolerance techniques for sparse matrix solvers

McIntosh–Smith

Hunt

Price

et al. 2017

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

show abstract

“…If every region of memory is equally likely to experience an uncorrectable error, we would expect to see relatively few errors in kernel memory because it typically occupies a much smaller memory footprint than the application. However, recent evidence suggests that kernel memory may be more prone to memory errors than other regions of memory (Hwang et al, 2012).…”

Section: Introductionmentioning

confidence: 99%

“… 1. This might happen if, for example, the MCE was raised by a memory scrubber. However, it is not clear that this is a common scenario (Hwang et al, 2012). …”

mentioning

confidence: 99%

A study of the viability of exploiting memory content similarity to improve resilience to memory errors

Levy

Ferreira

Bridges

et al. 2014

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

Building the next-generation of extreme-scale distributed systems will require overcoming several challenges related to system resilience. As the number of processors in these systems grow, the failure rate increases proportionally. One of the most common sources of failure in large-scale systems is memory. In this paper, we propose a novel runtime for transparently exploiting memory content similarity to improve system resilience by reducing the rate at which memory errors lead to node failure. We evaluate the viability of this approach by examining memory snapshots collected from eight high-performance computing (HPC) applications and two important HPC operating systems. Based on the characteristics of the similarity uncovered, we conclude that our proposed approach shows promise for addressing system resilience in large-scale systems.

show abstract

“…as well as software (operating system, runtime, unscheduled maintenance interruption). In fact, recent work indicates that (i) servers tend to crash twice a year (2-4% failure rate) [1], (ii) 1-5% of disk drives die per year [2], (iii) DRAM errors occur in 2% of all DIMMs per year [1], which is more frequent than commonly believed, and (iv) large scale studies indicate that simple ECC mechanisms alone are not capable of correcting a significant number of DRAM errors [3]. Even for small systems, such causes result in fairly low mean-time-between-failures/interrupts (MTBF/I) as depicted in Figure I [4], and the 6.9 hours estimated by Livermore National Lab for its BlueGene confirms this.…”

Section: Introductionmentioning

confidence: 99%

Detection and correction of silent data corruption for large-scale high-performance computing

Fiala

Mueller

Engelmann

et al. 2012

2012 International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

Abstract-Faults have become the norm rather than the exception for high-end computing on clusters with 10s/100s of thousands of cores. Exacerbating this situation, some of these faults remain undetected, manifesting themselves as silent errors that corrupt memory while applications continue to operate and report incorrect results. This paper studies the potential for redundancy to both detect and correct soft errors in MPI message-passing applications. Our study investigates the challenges inherent to detecting soft errors within MPI application while providing transparent MPI redundancy. By assuming a model wherein corruption in application data manifests itself by producing differing MPI message data between replicas, we study the best suited protocols for detecting and correcting MPI data that is the result of corruption.To experimentally validate our proposed detection and correction protocols, we introduce RedMPI, an MPI library which resides in the MPI profiling layer. RedMPI is capable of both online detection and correction of soft errors that occur in MPI applications without requiring any modifications to the application source by utilizing either double or triple redundancy.Our results indicate that our most efficient consistency protocol can successfully protect applications experiencing even high rates of silent data corruption with runtime overheads between 0% and 30% as compared to unprotected applications without redundancy.Using our fault injector within RedMPI, we observe that even a single soft error can have profound effects on running applications, causing a cascading pattern of corruption in most cases causes that spreads to all other processes. RedMPI's protection has been shown to successfully mitigate the effects of soft errors while allowing applications to complete with correct results even in the face of errors.

show abstract

Cosmic rays don't strike twice

Cited by 146 publications

References 18 publications

Application-based fault tolerance techniques for sparse matrix solvers

Application-based fault tolerance techniques for sparse matrix solvers

A study of the viability of exploiting memory content similarity to improve resilience to memory errors

Detection and correction of silent data corruption for large-scale high-performance computing

Contact Info

Product

Resources

About