GASPI/GPI In-memory Checkpointing Library

Bartsch, V.; Machado, Rui; Merten, D.; Rahn, Mirko; Pfreundt, Franz-Josef

doi:10.1007/978-3-319-64203-1_36

Cited by 4 publications

(4 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Through these features, applications can flexibly adjust to function with the remaining healthy processes. Moreover, a failed process can be replaced by implementing a suitable checkpointing scheme [64].…”

Section: A Programming Modelsmentioning

confidence: 99%

Malleability in Modern HPC Systems: Current Experiences, Challenges, and Future Opportunities

Tarraf,

Schreiber,

Cascajo

et al. 2024

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

With the increase of complex scientific simulations driven by workflows and heterogeneous workload profiles, managing system resources effectively is essential for improving performance and system throughput, especially due to trends like heterogeneous HPC and deeply integrated systems with on-chip accelerators. For optimal resource utilization, dynamic resource allocation can improve productivity across all system and application levels, by adapting the applications' configurations to the system's resources. In this context, malleable jobs, which can change resources at runtime, can increase the system throughput and resource utilization while bringing various advantages for HPC users (e.g., shorter waiting time). Malleability has received much attention recently, even though it has been an active research area for almost two decades [1]. This paper presents the state-of-the-art of malleable implementations in HPC systems, targeting mainly malleability in compute and I/O resources. Based on our experiences, we state our current concerns and list future opportunities for research.

show abstract

Section: A Programming Modelsmentioning

confidence: 99%

Malleability in Modern HPC Systems: Current Experiences, Challenges, and Future Opportunities

Tarraf,

Schreiber,

Cascajo

et al. 2024

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…As the number of nodes per parallel program execution continues to grow, the congestion on the PFS increases -resulting in a bottleneck and reduced checkpointing performance [15,16]. Examples for in-memory checkpointing libraries include LFLR [31], SCR [24], ftRMA [7], Fenix [14], and the algorithms described by Lu [21] and Bartsch et al [5]. All of these employ the substitute strategy and therefore rely on the availability of replacement nodes.…”

Section: Related Workmentioning

confidence: 99%

“…Checkpointing libraries usually write their checkpoints to a parallel file system (PFS) [2,6,28,25], implying slow recovery due to low disk access speeds and because many processors simultaneously access the same resources. Many checkpointing libraries also assume the nature of the failures to be minor such that the process can simply be started again, or they assume that enough spare resources are kept idle to start a new process for replacing the failed one [2,6,28,25,31,24,7,14,21,5]. Under this assumption, a re-spawned process can simply read exactly the data of the failed process.…”

Section: Introductionmentioning

confidence: 99%

ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms

Hespe¹,

Hübner²,

Sanders³

et al. 2022

Preprint

View full text Add to dashboard Cite

Fault-tolerant distributed applications require mechanisms to recover data lost via a process failure. On modern cluster systems it is typically impractical to request replacement resources after such a failure. Therefore, applications have to continue working with the remaining resources. This requires redistributing the workload and that the non-failed processes reload the lost data. We present an algorithmic framework and its C++ library implementation ReStore for MPI programs that enables recovery of lost data after (a) process failure(s). By storing all required data in memory via an appropriate data distribution and replication, recovery is substantially faster than with standard checkpointing schemes that rely on a parallel file system. As the application developer can specify which data to load, we also support shrinking recovery instead of recovery using spare compute nodes. We evaluate ReStore in both controlled, isolated environments and real applications. Our experiments show loading times of lost input data in the range of milliseconds on up to 24 576 processors and a substantial speedup of the recovery time for the fault-tolerant version of a widely used bioinformatics application.

show abstract

“…In order to have an asynchronous fault-tolerant application, we have used the GPI In-memory checkpoint library [4], in order to use a checkpoint/restart based methodology, saving the state of the execution at certain points of it, to be able to recover that state in case of failure. e application needs to decide when it is more reasonable to perform a checkpoint.…”

Section: Mixed Mpi/gpi-2mentioning

confidence: 99%

Performance Evaluation of an Algorithm-based Asynchronous Checkpoint-Restart Fault Tolerant Application Using Mixed MPI/GPI-2

Bazaga¹,

Pitoňák²

2018

Preprint

View full text Add to dashboard Cite

One of the hardest challenges of the current Big Data landscape is the lack of ability to process huge volumes of information in an acceptable time. e goal of this work, is to ascertain if it is useful to use typical Big Data tools to solve High Performance Computing problems, by exploring and comparing a distributed computing framework implemented on a commodity cluster architecture: the experiment will depend on the computational time required using tools such as Apache Spark. is will be compared to "equivalent more traditional" approaches such as using a distributed memory model with MPI on a distributed le system such as HDFS (Hadoop Distributed File System) and native C libraries that create an interface to encapsulate this le system functionalities, and using the GPI-2 implementation for the GASPI protocol and it's in-memory checkpointing library to provide an application with Fault Tolerance features. To be more precise, we've chosen the K-means algorithm as experiment, that will be ran on variable size datasets, and then we will compare the computational run time and time resilience of both approaches. CCS CONCEPTS•Computer systems organization → Dependable and faulttolerant systems and networks;

show abstract

GASPI/GPI In-memory Checkpointing Library

Cited by 4 publications

References 13 publications

Malleability in Modern HPC Systems: Current Experiences, Challenges, and Future Opportunities

Malleability in Modern HPC Systems: Current Experiences, Challenges, and Future Opportunities

ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms

Performance Evaluation of an Algorithm-based Asynchronous Checkpoint-Restart Fault Tolerant Application Using Mixed MPI/GPI-2

Contact Info

Product

Resources

About