2024
DOI: 10.1002/cpe.8043
|View full text |Cite
|
Sign up to set email alerts
|

Improving batch schedulers with node stealing for failed jobs

Yishu Du,
Loris Marchal,
Guillaume Pallez
et al.

Abstract: SummaryAfter a machine failure, batch schedulers typically re‐schedule the job that failed with a high priority. This is fair for the failed job but still requires that job to re‐enter the submission queue and to wait for enough resources to become available. The waiting time can be very long when the job is large and the platform highly loaded, as is the case with typical HPC platforms. We propose another strategy: when a job fails, if no platform node is available, we steal one node from another job , and u… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
references
References 32 publications
0
0
0
Order By: Relevance