Algorithm-based fault tolerance for dense matrix factorizations

Du, Peng; Bouteiller, Aurélien; Bosilca, George; Hérault, Thomas; Dongarra, Jack

doi:10.1145/2145816.2145845

Cited by 84 publications

(66 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…ABFT was first introduced to deal with silent error in systolic arrays [7]. In recent work, the technique has been employed to recover from process failures [17,10,9] in dense and sparse linear algebra factorizations [11,12,13], but the idea extends widely to numerous algorithms employed in crucial HPC applications. So called Naturally Fault Tolerant algorithms can simply obtain the correct result despite the loss of portions of the dataset (typical of this are master-slave programs, but also iterative refinement methods, like GMRES or CG [8,18]).…”

Section: Related Workmentioning

confidence: 99%

“…A significant new contribution is to propose a generalized model for a protocol that alternates between checkpointing and ABFT sections. Although most ABFT methods have a complete complexity analysis (in terms of extra-flops, communications incurred by both protection activity and per-recovery cost [10,9]), modeling the runtime overhead of ABFT methods under failure conditions has never been proposed. The composite model captures both the behavior of checkpointing and ABFT phases, as well as the cost of switching between the two approaches, and thereby permits investing the prospective gain from employing this mixed recovery strategy on extreme scale platforms.…”

Section: Related Workmentioning

confidence: 99%

“…The time required to take a checkpoint and rollback the whole application is 10 minutes (C, R), a consistent order of magnitude for current applications at large scale [5]. We consider that the ratio of the memory that is modified by the Library phase (ρ) is fixed at 0.8 (to vary a single parameter at a time in our simulation), and the overhead due to ABFT is φ = 1.03 (again, typical from production deployments [9]). Figure 8 presents 6 evaluations of that scenario.…”

Section: Validationmentioning

confidence: 99%

“…Strategies such as Algorithm Based Fault Tolerance (ABFT) [7], naturally fault tolerant iterative algorithms [8], resubmission in master-slave applications, etc., can deliver more scalable performance under high stress from process failures. As an example, ABFT protection and recovery activities are not only inexpensive (typically less than 3% overhead observed in experimental works [9,10]), but also have a negligible asymptotic overhead when increasing node count, which makes them extremely scalable. This is in sharp contrast with checkpointing, which suffers from increasing overhead with system size.…”

Section: Introductionmentioning

confidence: 99%

“…This is in sharp contrast with checkpointing, which suffers from increasing overhead with system size. ABFT is a useful technique for production systems, offering protection to important infrastructure software such as the dense distributed linear algebra library ScaLAPACK [9]. In the remainder of this paper, without loss of generality, we will use the term ABFT broadly, so as to describe any technique that uses algorithm properties to provide protection and recovery without resorting to rollback recovery.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Composing resilience techniques: ABFT, periodic and incremental checkpointing

Bosilca

Bouteiller

Hérault

et al. 2015

IJNC

Self Cite

View full text Add to dashboard Cite

Algorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalability and performance in failure-prone environments. Thanks to recent advances in the understanding of the involved mechanisms, a growing number of important algorithms (including all widely used factorizations) have been proven ABFT-capable. In the context of larger applications, these algorithms provide a temporal section of the execution, where the data is protected by its own intrinsic properties, and can therefore be algorithmically recomputed without the need of checkpoints. However, while typical scientific applications spend a significant fraction of their execution time in library calls that can be ABFT-protected, they interleave sections that are difficult or even impossible to protect with ABFT. As a consequence, the only practical fault-tolerance approach for these applications is checkpoint/restart. In this paper we propose a model to investigate the efficiency of a composite protocol, that alternates between ABFT and checkpoint/restart for the effective protection of an iterative application composed of ABFTaware and ABFT-unaware sections. We also consider an incremental checkpointing composite approach in which the algorithmic knowledge is leveraged by a novel optimal dynamic programming to compute checkpoint dates. We validate these models using a simulator. The model and simulator show that the composite approach drastically increases the performance delivered by an execution platform, especially at scale, by providing the means to increase the interval between checkpoints while simultaneously decreasing the volume of each checkpoint.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Validationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Composing resilience techniques: ABFT, periodic and incremental checkpointing

Bosilca

Bouteiller

Hérault

et al. 2015

IJNC

Self Cite

View full text Add to dashboard Cite

show abstract

Extending the scope of the Checkpoint‐on‐Failure protocol for forward recovery in standard MPI

Bland

Bouteiller

et al. 2013

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Abstract. Most predictions of Exascale machines picture billion way parallelism, encompassing not only millions of cores, but also tens of thousands of nodes. Even considering extremely optimistic advances in hardware reliability, probabilistic amplification entails that failures will be unavoidable. Consequently, software fault tolerance is paramount to maintain future scientific productivity. Two major problems hinder ubiquitous adoption of fault tolerance techniques: 1) traditional checkpoint based approaches incur a steep overhead on failure free operations and 2) the dominant programming paradigm for parallel applications (the MPI Standard) offers extremely limited support of software-level fault tolerance approaches. In this paper, we present an approach that relies exclusively on the features of a high quality implementation, as defined by the current MPI Standard, to enable algorithmic based recovery, without incurring the overhead of customary periodic checkpointing. The validity and performance of this approach are evaluated on large scale systems, using the QR factorization as an example.

show abstract

Performance and power modeling and prediction using MuMMI and 10 machine learning methods

Taylor

Lan

2022

Concurrency and Computation

View full text Add to dashboard Cite

Energy-efficient scientific applications require insight into how high performance computing system features impact the applications' power and performance. This insight can result from the development of performance and power models. In this article, we use the modeling and prediction tool MuMMI (Multiple Metrics Modeling Infrastructure) and 10 machine learning methods to model and predict performance and power consumption and compare their prediction error rates. We use an algorithm-based fault-tolerant linear algebra code and a multilevel checkpointing fault-tolerant heat distribution code to conduct our modeling and prediction study on the Cray XC40 Theta and IBM BG/Q Mira at Argonne National Laboratory and the Intel Haswell cluster Shepard at Sandia National Laboratories. Our experimental results show that the prediction error rates in performance and power using MuMMI are less than 10% for most cases. By utilizing the models for runtime, node power, CPU power, and memory power, we identify the most significant performance counters for potential application optimizations, and we predict theoretical outcomes of the optimizations. Based on two collected datasets, we analyze and compare the prediction accuracy in performance and power consumption using MuMMI and 10 machine learning methods.

show abstract

Algorithm-based fault tolerance for dense matrix factorizations

Cited by 84 publications

References 13 publications

Composing resilience techniques: ABFT, periodic and incremental checkpointing

Composing resilience techniques: ABFT, periodic and incremental checkpointing

Extending the scope of the Checkpoint‐on‐Failure protocol for forward recovery in standard MPI

Performance and power modeling and prediction using MuMMI and 10 machine learning methods

Contact Info

Product

Resources

About