2019
DOI: 10.1007/s11227-019-02857-y
|View full text |Cite
|
Sign up to set email alerts
|

Job migration in HPC clusters by means of checkpoint/restart

Abstract: Until now, jobs running on HPC clusters were tied to the node where their execution started. We have removed that limitation by integrating a user level Checkpoint/Restart library into a Resource Manager, fully transparent to both the user and running application. This opens the door to a whole new set of tools and scheduling possibilities based on the fact that jobs can be migrated, checkpointed and restarted on a different place or in a different moment, while providing fault-tolerance for every job running … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 13 publications
(12 citation statements)
references
References 40 publications
0
12
0
Order By: Relevance
“…Niu et al [30] have also shown an example of how checkpointing along with preemptive scheduling can increase the performance of the backfill algorithm, and in this work we follow this approach in practice. To the authors' knowledge, aside from [31], this has never been done before.…”
Section: B Scheduling Methods In Slurmmentioning
confidence: 98%
See 3 more Smart Citations
“…Niu et al [30] have also shown an example of how checkpointing along with preemptive scheduling can increase the performance of the backfill algorithm, and in this work we follow this approach in practice. To the authors' knowledge, aside from [31], this has never been done before.…”
Section: B Scheduling Methods In Slurmmentioning
confidence: 98%
“…A recent development of the SLURM scheduler is the seamless incorporation of the Distributed MultiThreaded Checkpointing (DMTCP) [13] library, enabling it to transparently Checkpoint and Restart (C/R) a single-host, parallel or distributed computation. It does so in user-space, with no modifications to user code or the operating system, supporting a variety of HPC languages and infrastructures, including MPI and OpenMP [31].…”
Section: The Optimized Memoryless Fair-sharementioning
confidence: 99%
See 2 more Smart Citations
“…Currently, such techniques use local storage independently on each compute node via a single shared link, but can be complemented to leverage local storage of remote nodes. Additionally, checkpoint-restart techniques are also used for accommodating on-demand jobs with batch jobs [4], [5] and workload migration [6], [7].…”
Section: Related Workmentioning
confidence: 99%