ESoftCheck: Removal of Non-vital Checks for Fault Tolerance

Yu, Jing; Garzarán, María Jesús; Snir, Marc

doi:10.1109/cgo.2009.14

Cited by 27 publications

(10 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly, transactional semantics [24] can be used for MPI, but not on application variables that may be corrupted. Esoftcheck [46], uses compiler analysis to remove redundant SDC detectors to maintain high reliability, but does not consider the latency of detection and how it effects propagation. An analytic version of this problem which investigates optimal placement of detectors of different capabilities to verify a checkpoint is corruption free is presented in [4], but considers a fixed recovery time that does not change based on how much state is corrupted.…”

Section: Related Workmentioning

confidence: 99%

Towards a More Complete Understanding of SDC Propagation

Calhoun

Snir

Olson

et al. 2017

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

With the rate of errors that can silently effect an application's state/output expected to increase on future HPC machines, numerous application-level detection and recovery schemes have been proposed. Recovery is more efficient when errors are contained and affect only part of the computation's state. Containment is usually achieved by verifying all information leaking out of a statically defined containment domain, which is an expensive procedure. Alternatively, error propagation can be analyzed to bound the domain that is affected by a detected error. This paper investigates how silent data corruption (SDC) due to soft errors propagates through three HPC applications: HPCCG, Jacobi, and CoMD. To allow for more detailed view of error propagation, the paper tracks propagation at the instruction and application variable level. The impact of detection latency on error propagation is shown along with an application's ability to recover. Finally, the impact of compiler optimizations are explored along with the impact of local problem size on error propagation.

show abstract

Section: Related Workmentioning

confidence: 99%

Towards a More Complete Understanding of SDC Propagation

Calhoun

Snir

Olson

et al. 2017

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

show abstract

“…Furthermore, redundancy can be used in order to reduce the probability of ‘bad errors’. Critical computations can be executed twice (and the redundancy can be introduced automatically by a compiler; see Reis et al, 2005b; Yu et al, 2009); more reliable memory may be used for more sensitive data, and so forth.…”

Section: Possible Scenariosmentioning

confidence: 99%

Addressing failures in exascale computing

Snir

Wisniewski

Abraham

et al. 2014

The International Journal of High Performance Computing Applica

Self Cite

282

View full text Add to dashboard Cite

We present here a report produced by a workshop on 'Addressing failures in exascale computing' held in Park City, Utah, 4-11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software layers of an exascale system, and build on those results, examining potential solutions from both a hardware and software perspective and focusing on a combined approach.The workshop brought together participants with expertise in applications, system software, and hardware; they came from industry, government, and academia, and their interests ranged from theory to implementation. The combination allowed broad and comprehensive discussions and led to this document, which summarizes and builds on those discussions.

show abstract

“…Soft errors can be handled by dual modular redundancy (DMR). DMR approaches, typically assisted by compilers, duplicate computing instructions and insert check instructions into the original programs [14,40,41,48,68]. DMR is very general and can be applied to any application, but it introduces high overhead especially for computing-bound applications because it duplicates all computations.…”

Section: Introductionmentioning

confidence: 99%

Ft-Blas

Zhai

Giem

Fan

et al. 2021

Proceedings of the ACM International Conference on Supercomputing

View full text Add to dashboard Cite

Basic Linear Algebra Subprograms (BLAS) is a core library in scientific computing and machine learning. This paper presents FT-BLAS, a new implementation of BLAS routines that not only tolerates soft errors on the fly, but also provides comparable performance to modern state-of-the-art BLAS libraries on widely-used processors such as Intel Skylake and Cascade Lake. To accommodate the features of BLAS, which contains both memory-bound and computing-bound routines, we propose a hybrid strategy to incorporate fault tolerance into our brand-new BLAS implementation: duplicating computing instructions for memory-bound Level-1 and Level-2 BLAS routines and incorporating an Algorithm-Based Fault Tolerance mechanism for computing-bound Level-3 BLAS routines. Our high performance and low overhead are obtained from delicate assembly-level optimization and a kernel-fusion approach to the computing kernels. Experimental results demonstrate that FT-BLAS offers high reliability and high performance -faster than Intel MKL, OpenBLAS, and BLIS by up to 3.50%, 22.14% and 21.70%, respectively, for routines spanning all three levels of BLAS we benchmarked, even under hundreds of errors injected per minute.

show abstract

ESoftCheck: Removal of Non-vital Checks for Fault Tolerance

Cited by 27 publications

References 35 publications

Towards a More Complete Understanding of SDC Propagation

Towards a More Complete Understanding of SDC Propagation

Addressing failures in exascale computing

Ft-Blas

Contact Info

Product

Resources

About