Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2012
DOI: 10.1145/2145816.2145845
|View full text |Cite
|
Sign up to set email alerts
|

Algorithm-based fault tolerance for dense matrix factorizations

Abstract: Dense matrix factorizations, like LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This paper proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factorizations algorithms survive fail-stop failures. We consider … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
65
0

Year Published

2013
2013
2022
2022

Publication Types

Select...
3
3

Relationship

3
3

Authors

Journals

citations
Cited by 84 publications
(66 citation statements)
references
References 13 publications
1
65
0
Order By: Relevance
“…ABFT was first introduced to deal with silent error in systolic arrays [7]. In recent work, the technique has been employed to recover from process failures [17,10,9] in dense and sparse linear algebra factorizations [11,12,13], but the idea extends widely to numerous algorithms employed in crucial HPC applications. So called Naturally Fault Tolerant algorithms can simply obtain the correct result despite the loss of portions of the dataset (typical of this are master-slave programs, but also iterative refinement methods, like GMRES or CG [8,18]).…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…ABFT was first introduced to deal with silent error in systolic arrays [7]. In recent work, the technique has been employed to recover from process failures [17,10,9] in dense and sparse linear algebra factorizations [11,12,13], but the idea extends widely to numerous algorithms employed in crucial HPC applications. So called Naturally Fault Tolerant algorithms can simply obtain the correct result despite the loss of portions of the dataset (typical of this are master-slave programs, but also iterative refinement methods, like GMRES or CG [8,18]).…”
Section: Related Workmentioning
confidence: 99%
“…A significant new contribution is to propose a generalized model for a protocol that alternates between checkpointing and ABFT sections. Although most ABFT methods have a complete complexity analysis (in terms of extra-flops, communications incurred by both protection activity and per-recovery cost [10,9]), modeling the runtime overhead of ABFT methods under failure conditions has never been proposed. The composite model captures both the behavior of checkpointing and ABFT phases, as well as the cost of switching between the two approaches, and thereby permits investing the prospective gain from employing this mixed recovery strategy on extreme scale platforms.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations