2013
DOI: 10.1002/cpe.3100
|View full text |Cite
|
Sign up to set email alerts
|

Extending the scope of the Checkpoint‐on‐Failure protocol for forward recovery in standard MPI

Abstract: Abstract. Most predictions of Exascale machines picture billion way parallelism, encompassing not only millions of cores, but also tens of thousands of nodes. Even considering extremely optimistic advances in hardware reliability, probabilistic amplification entails that failures will be unavoidable. Consequently, software fault tolerance is paramount to maintain future scientific productivity. Two major problems hinder ubiquitous adoption of fault tolerance techniques: 1) traditional checkpoint based approach… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
10
0

Year Published

2013
2013
2022
2022

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 17 publications
(10 citation statements)
references
References 21 publications
0
10
0
Order By: Relevance
“…Few papers [13][14][15][16][17][18] have been published discussing the need and techniques for checkpointing web services.…”
Section: Checkpointing In Web Services: Literature Surveymentioning
confidence: 99%
“…Few papers [13][14][15][16][17][18] have been published discussing the need and techniques for checkpointing web services.…”
Section: Checkpointing In Web Services: Literature Surveymentioning
confidence: 99%
“…Yao and Wang [52] proposed a nonstop algorithm based fault tolerant scheme to recover the solution vector from fail-stop process failures in HPL 2.0. Bland et al [7,8] proposed a Checipoint-on-Failure protocol for fault recovery in dense linear algebra.…”
Section: Related Workmentioning
confidence: 99%
“…The application itself may be able to handle the error and terminate cleanly [5] or perform some sort of recovery procedure relying on Algorithmic-Based Fault Tolerance (ABFT), which has been extensively applied to MPI programs [10], [17], [29], as well as shared memory programming models [38], [40]. Algorithmic approaches have demonstrated to be more efficient than backward recoveries like checkpointing-rollback.…”
Section: Introductionmentioning
confidence: 99%