A-ABFT: Autonomous Algorithm-Based Fault Tolerance for Matrix Multiplications on Graphics Processing Units

Braun, Claus; Halder, Sebastian; Wunderlich, Hans Joachim

doi:10.1109/dsn.2014.48

Cited by 33 publications

(15 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Braun et al [4] used a simple fault injector (for matrix multiplication) that was able to inject faults into a target SM and one of its functional units. Ours can also target a specific SM and SP, but has the added capability of allowing for variable fault start time and active duration.…”

Section: Related Workmentioning

confidence: 99%

Transient Fault Resilient QR Factorization on GPUs

Loh

Ramanathan

Saluja

2015

Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale

View full text Add to dashboard Cite

With their inherent capability to exploit parallelism, GPUs have become a popular platform for data-intensive scientific computing applications. This trend is expected to continue as the number of computations required by scientific applications reach the petascale and even exascale range. As the minimum feature size of transistors decreases due to improving process technology, GPUs are becoming more vulnerable to transient faults caused by events such as power fluctuations and alpha particle strikes, therefore we need methods that guarantee correct computation even in the presence of such faults. In this paper, we develop and analyze three fault tolerant schemes, FC-O, PC-C and PC-CS, for the block Householder QR algorithm that can deal with faults in the streaming processor (SP) core of a GPU. We also present a transient fault injection mechanism for NVIDIA GPUs, which has the capability of injecting faults of varying durations. We show that two of our schemes, PC-C and PC-CS, have good error coverage and relatively low overhead, and can scale reasonably well at the petascale and exascale range.

show abstract

Section: Related Workmentioning

confidence: 99%

Transient Fault Resilient QR Factorization on GPUs

Loh

Ramanathan

Saluja

2015

Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale

View full text Add to dashboard Cite

show abstract

“…Similar to other works [15], [16], Our approach uses the reciprocal distribution of mantissa bits to calculate the rounding error bound for a block checksum. The principle of the method is based on the fact that matrix multiplication consists of multiple steps of multiplication and addition, and the rounding error bound can be obtained by calculating expectation and variance during those steps.…”

Section: B Block Size and Rounding Errormentioning

confidence: 99%

“…A simplified error analysis (SEA) approach for ABFT is introduced in [23]. A-ABFT calculates the range of rounding errors through the probability distribution of floating-point tails [15].…”

Section: Related Workmentioning

confidence: 99%

FT-PBLAS: PBLAS-Based Fault-Tolerant Linear Algebra Computation on High-performance Computing Systems

2020

View full text Add to dashboard Cite

As high-performance computing (HPC) systems have scaled up, resilience has become a great challenge. To guarantee resilience, various kinds of hardware and software techniques have been proposed. However, among popular software fault-tolerant techniques, both the checkpoint-restart approach and the replication technique face challenges of scalability in the era of peta-and exa-scale systems due to their numerous processes. In this situation, algorithm-based approaches, or algorithm-based fault tolerance (ABFT) mechanisms, have become attractive because they are efficient and lightweight. Although the ABFT technique is algorithm-dependent, it is possible to implement it at a low level (e.g., in libraries for basic numerical algorithms) and make it application-independent. However, previous ABFT approaches have mainly aimed at achieving fault tolerance in integrated circuits (ICs) or at the architecture level and are therefore not suitable for HPC systems; e.g., they use checksums of rows and columns of matrices rather than checksums of blocks to detect errors. Furthermore, they cannot deal with errors caused by node failure, which are common in current HPC systems. To solve these problems, this paper proposes FT-PBLAS, a PBLAS-based library for fault-tolerant parallel linear algebra computations that can be regarded as a fault-tolerant version of the parallel basic linear algebra subprograms (PBLAS), because it provides a series of fault-tolerant versions of interfaces in PBLAS. To support the underlying error detection and recovery mechanisms in the library, we propose a block-checksum approach for non-fatal errors and a scheme for addressing node failure, respectively. We evaluate two fault-tolerant mechanisms and FT-PBLAS on HPC systems, and the experimental results demonstrate the performance of our library. INDEX TERMS Algorithm-based fault tolerance, HPC systems, node failure, matrix multiplication, linear algebra computations.

show abstract

“…It imposes a low overhead on the application and guarantees a good SDC detection recall in general. Some recent studies [8] have shown that ABFT can be implemented for matrix multiplications on hardware accelerators. In addition to its detection capability, ABFT offers correction features.…”

Section: Algorithm-based Fault Tolerancementioning

confidence: 99%

Detecting Silent Data Corruption for Extreme-Scale MPI Applications

Bautista-Gomez

Cappello

2015

Proceedings of the 22nd European MPI Users' Group Meeting

View full text Add to dashboard Cite

Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. These trends are pushing supercomputer construction to the limits of miniaturization and energy-saving strategies. Consequently, the number of soft errors is expected to increase dramatically in the coming years. While mechanisms are in place to correct or at least detect some soft errors, a significant percentage of those errors pass unnoticed by the hardware. Such silent errors are extremely damaging because they can make applications silently produce wrong results. In this work we propose a technique that leverages certain properties of high-performance computing applications in order to detect silent errors at the application level. Our technique detects corruption based solely on the behavior of the application datasets and is applicationagnostic. We propose multiple corruption detectors, and we couple them to work together in a fashion transparent to the user. We demonstrate that this strategy can detect over 80% of corruptions, while incurring less than 1% of overhead. We show that the false positive rate is less than 1% and that when multi-bit corruptions are taken into account, the detection recall increases to over 95%.

show abstract

A-ABFT: Autonomous Algorithm-Based Fault Tolerance for Matrix Multiplications on Graphics Processing Units

Cited by 33 publications

References 26 publications

Transient Fault Resilient QR Factorization on GPUs

Transient Fault Resilient QR Factorization on GPUs

FT-PBLAS: PBLAS-Based Fault-Tolerant Linear Algebra Computation on High-performance Computing Systems

Detecting Silent Data Corruption for Extreme-Scale MPI Applications

Contact Info

Product

Resources

About