2018
DOI: 10.1177/1094342018767736
|View full text |Cite
|
Sign up to set email alerts
|

A scalable and extensible checkpointing scheme for massively parallel simulations

Abstract: Realistic simulations in engineering or in the materials sciences can consume enormous computing resources and thus require the use of massively parallel supercomputers. The probability of a failure increases both with the runtime and with the number of system components. For future exascale systems it is therefore considered critical that strategies are developed to make software resilient against failures. In this article, we present a scalable, distributed, diskless, and resilient checkpointing scheme that … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
20
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 20 publications
(20 citation statements)
references
References 61 publications
(127 reference statements)
0
20
0
Order By: Relevance
“…Such data structures could also for example store logistic data as required for handling special boundary conditions or metadata that is needed for load balancing. Similar approaches to decouple the simulation data from the coarse-grained topology have been presented in [7,27] and have been shown to provide an elegant interface for runtime load balancing [28], resilience, and data migration [29]. The macro-primitive graph data structure is therefore not restricted to finite-element based simulation techniques.…”
Section: Simulation Datamentioning
confidence: 97%
“…Such data structures could also for example store logistic data as required for handling special boundary conditions or metadata that is needed for load balancing. Similar approaches to decouple the simulation data from the coarse-grained topology have been presented in [7,27] and have been shown to provide an elegant interface for runtime load balancing [28], resilience, and data migration [29]. The macro-primitive graph data structure is therefore not restricted to finite-element based simulation techniques.…”
Section: Simulation Datamentioning
confidence: 97%
“…Obersteiner et al (2017) extended a plasma simulation, Laguna et al (2016) a molecular dynamics simulation, and Engelmann and Geist (2003) a Fast Fourier Transformation that gracefully handle hardware faults. Kohl et al (2017) implemented a checkpoint-recovery system for a material science simulation. After a failure, the system initially assigns the work of the failed PEs to a single PE.…”
Section: Related Workmentioning
confidence: 99%
“…Researchers have already used ULFM for other scientific software (Ali et al, 2016;Engelmann and Geist, 2003;Kohl et al, 2017;Laguna et al, 2016;Obersteiner et al, 2017). ULFM reports failures by returning an error on at least one rank which participated in the failed communication.…”
Section: The New Mpi Standard and User Levelmentioning
confidence: 99%
“…In addition, when using checkpointing techniques, the additional time for each checkpoint delays the termination of the program and automatically yields a higher power consumption. Moreover, the memory necessary for checkpointing needs to be provided Kohl et al (2017) and must be kept in a reliable state, which may further increase energy consumption. Besides hard failures, which cause a physical loss of a computing entity, an increase of failures which are not immediately noticeable, is expected.…”
Section: Introductionmentioning
confidence: 99%
“…For the reconstruction of the static data, we refer to the in-memory checkpointing approaches and their performance analysis presented (e.g. Kohl et al, 2019).…”
Section: Introductionmentioning
confidence: 99%