2015
DOI: 10.1109/tpds.2014.2342228
|View full text |Cite
|
Sign up to set email alerts
|

Using Migratable Objects to Enhance Fault Tolerance Schemes in Supercomputers

Abstract: Abstract-Supercomputers have seen an exponential increase in their size in the last two decades. Such a high growth rate is expected to take us to exascale in the timeframe 2018-2022. But, to bring a productive exascale environment about, it is necessary to focus on several key challenges. One of those challenges is fault tolerance. Machines at extreme scale will experience frequent failures and will require the system to avoid or overcome those failures. Various techniques have recently been developed to tole… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
25
0

Year Published

2015
2015
2023
2023

Publication Types

Select...
4
1
1

Relationship

2
4

Authors

Journals

citations
Cited by 26 publications
(25 citation statements)
references
References 35 publications
0
25
0
Order By: Relevance
“…The checkpoint buddy of the failed node will provide a checkpoint to the replacement node. This approach has been shown to be scalable and it can recover from most failures in real HPC systems [10]. Other variants of checkpoint/restart create a multilevel framework where checkpoints can go to local or shared storage [7].…”
Section: B Checkpoint/restartmentioning
confidence: 99%
See 3 more Smart Citations
“…The checkpoint buddy of the failed node will provide a checkpoint to the replacement node. This approach has been shown to be scalable and it can recover from most failures in real HPC systems [10]. Other variants of checkpoint/restart create a multilevel framework where checkpoints can go to local or shared storage [7].…”
Section: B Checkpoint/restartmentioning
confidence: 99%
“…An enhancement to checkpoint/restart is message logging [15], a technique that stores checkpoints and, in principle, stores all the messages in an execution. There are multiple implementations of message logging for HPC systems [10], [16], [17]. The benefit of storing communication is that a failure only requires the failed node to rollback, hence only local rollback is needed.…”
Section: Message Loggingmentioning
confidence: 99%
See 2 more Smart Citations
“…In contrast, the Cray XK6/XK7 (Titan) at Oak Ridge National Laboratory (10-20/27 petaflops) achieves a MTBI of 132/173 h [2]. The anticipated failure rate of an exascale machine is likely to be higher than present systems [8,9,23,28] and therefore application resilience is critical in maintaining the usefulness of any future exascale system.…”
Section: Introductionmentioning
confidence: 99%