2017
DOI: 10.1007/978-3-319-64203-1_36
|View full text |Cite
|
Sign up to set email alerts
|

GASPI/GPI In-memory Checkpointing Library

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 13 publications
0
4
0
Order By: Relevance
“…Through these features, applications can flexibly adjust to function with the remaining healthy processes. Moreover, a failed process can be replaced by implementing a suitable checkpointing scheme [64].…”
Section: A Programming Modelsmentioning
confidence: 99%
“…Through these features, applications can flexibly adjust to function with the remaining healthy processes. Moreover, a failed process can be replaced by implementing a suitable checkpointing scheme [64].…”
Section: A Programming Modelsmentioning
confidence: 99%
“…As the number of nodes per parallel program execution continues to grow, the congestion on the PFS increases -resulting in a bottleneck and reduced checkpointing performance [15,16]. Examples for in-memory checkpointing libraries include LFLR [31], SCR [24], ftRMA [7], Fenix [14], and the algorithms described by Lu [21] and Bartsch et al [5]. All of these employ the substitute strategy and therefore rely on the availability of replacement nodes.…”
Section: Related Workmentioning
confidence: 99%
“…Checkpointing libraries usually write their checkpoints to a parallel file system (PFS) [2,6,28,25], implying slow recovery due to low disk access speeds and because many processors simultaneously access the same resources. Many checkpointing libraries also assume the nature of the failures to be minor such that the process can simply be started again, or they assume that enough spare resources are kept idle to start a new process for replacing the failed one [2,6,28,25,31,24,7,14,21,5]. Under this assumption, a re-spawned process can simply read exactly the data of the failed process.…”
Section: Introductionmentioning
confidence: 99%
“…In order to have an asynchronous fault-tolerant application, we have used the GPI In-memory checkpoint library [4], in order to use a checkpoint/restart based methodology, saving the state of the execution at certain points of it, to be able to recover that state in case of failure. e application needs to decide when it is more reasonable to perform a checkpoint.…”
Section: Mixed Mpi/gpi-2mentioning
confidence: 99%