2013
DOI: 10.1007/978-3-642-35867-8_3
|View full text |Cite
|
Sign up to set email alerts
|

Employing Checkpoint to Improve Job Scheduling in Large-Scale Systems

Abstract: Abstract. The FCFS-based backfill algorithm is widely used in scheduling high-performance computer systems. The algorithm relies on runtime estimate of jobs which is provided by users. However, statistics show the accuracy of user-provided estimate is poor. Users are very likely to provide a much longer runtime estimate than its real execution time. In this paper, we propose an aggressive backfilling approach with checkpoint based preemption to address the inaccuracy in user-provided runtime estimate. The appr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2014
2014
2023
2023

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 12 publications
(9 citation statements)
references
References 24 publications
0
9
0
Order By: Relevance
“…Both FCFS and backfill have been studied thoroughly and while being relatively simple, they are used extensively in current supercomputers' job scheduling systems. The main reason is that practical limitations prevent the use of other scheduling algorithms [30]. In particular, FCFS-backfill relies on users' estimates for jobs' run-time, which have been proven to be highly inaccurate [19], [26], [30].…”
Section: B Scheduling Methods In Slurmmentioning
confidence: 99%
“…Both FCFS and backfill have been studied thoroughly and while being relatively simple, they are used extensively in current supercomputers' job scheduling systems. The main reason is that practical limitations prevent the use of other scheduling algorithms [30]. In particular, FCFS-backfill relies on users' estimates for jobs' run-time, which have been proven to be highly inaccurate [19], [26], [30].…”
Section: B Scheduling Methods In Slurmmentioning
confidence: 99%
“…This can also be triggered by the HTC administrator. We do not consider here such cases as task suspension (execution starvation) or task checkpointing and migration [13] as these do not affect the execution of the other replicas.…”
Section: Htc-simmentioning
confidence: 99%
“…User provided estimates have, however, been widely criticised by the scheduling community for their inaccuracy [24], [25]. Niu et al [26] analyse the traces of four large-scale systems from the Parallel Workloads Archive [27] finding only 17% of jobs completed within 90-110% of their estimate.…”
Section: Duration Predictionmentioning
confidence: 99%
“…However, user estimates of job execution time have been shown to be unreliable [24], [25], [26]. We evaluate three estimation policies: Perfect: Perfect a priori knowledge of job duration.…”
Section: B Execution Time Estimationmentioning
confidence: 99%