Employing Checkpoint to Improve Job Scheduling in Large-Scale Systems

Niu, Shuangcheng; Zhai, Jidong; Ma, Xiaosong; Liu, Mingliang; Zhai, Yu; Chen, Wenguang; Zheng, Weimin

doi:10.1007/978-3-642-35867-8_3

Cited by 12 publications

(9 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Both FCFS and backfill have been studied thoroughly and while being relatively simple, they are used extensively in current supercomputers' job scheduling systems. The main reason is that practical limitations prevent the use of other scheduling algorithms [30]. In particular, FCFS-backfill relies on users' estimates for jobs' run-time, which have been proven to be highly inaccurate [19], [26], [30].…”

Section: B Scheduling Methods In Slurmmentioning

confidence: 99%

Optimized Memoryless Fair-Share HPC Resources Scheduling using Transparent Checkpoint-Restart Preemption

Zvi¹,

Oren²

2021

Preprint

View full text Add to dashboard Cite

Common resource management methods in supercomputing systems usually include hard divisions, capping, and quota allotment. Those methods, despite their 'advantages', have some known serious disadvantages including unoptimized utilization of an expensive facility, and occasionally there is still a need to dynamically reschedule and reallocate the resources. Consequently, those methods involve bad supply-and-demand management rather than a free market playground that will eventually increase system utilization and productivity. In this work, we propose the newly Optimized Memoryless Fair-Share HPC Resources Scheduling using Transparent Checkpoint-Restart Preemption, in which the social welfare increases using a free-of-cost interchangeable proprietary possession scheme. Accordingly, we permanently keep the status-quo in regard to the fairness of the resources distribution while maximizing the ability of all users to achieve more CPUs and CPU hours for longer period without any non-straightforward costs, penalties or additional human intervention.

show abstract

Section: B Scheduling Methods In Slurmmentioning

confidence: 99%

Optimized Memoryless Fair-Share HPC Resources Scheduling using Transparent Checkpoint-Restart Preemption

Zvi¹,

Oren²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…This can also be triggered by the HTC administrator. We do not consider here such cases as task suspension (execution starvation) or task checkpointing and migration [13] as these do not affect the execution of the other replicas.…”

Section: Htc-simmentioning

confidence: 99%

Evaluation of Energy Consumption of Replicated Tasks in a Volunteer Computing Environment

McGough

Forshaw

2018

Companion of the 2018 ACM/SPEC International Conference on Performance Engineering

View full text Add to dashboard Cite

High Throughput Computing allows workloads of many thousands of tasks to be performed efficiently over many distributed resources and frees the user from the laborious process of managing task deployment, execution and result collection. However, in many cases the High Throughput Computing system is comprised from volunteer computational resources where tasks may be evicted by the owner of the resource. This has two main disadvantages. First, tasks may take longer to run as they may require multiple deployments before finally obtaining enough time on a resource to complete. Second, the wasted computation time will lead to wasted energy. We may be able to reduce the effect of the first disadvantage here by submitting multiple replicas of the task and take the results from the first one to complete. This, though, could lead to a significant increase in energy consumption. Thus we desire to only ever submit the minimum number of replicas required to run the task in the allocated time whilst simultaneously minimising energy. In this work we evaluate the use of fixed replica counts and Reinforcement Learning on the proportion of task which fail to finish in a given time-frame and the energy consumed by the system. CCS CONCEPTS• Computing methodologies → Sequential decision making; Simulation evaluation; • Hardware → Enterprise level and data centers power issues;

show abstract

“…User provided estimates have, however, been widely criticised by the scheduling community for their inaccuracy [24], [25]. Niu et al [26] analyse the traces of four large-scale systems from the Parallel Workloads Archive [27] finding only 17% of jobs completed within 90-110% of their estimate.…”

Section: Duration Predictionmentioning

confidence: 99%

“…However, user estimates of job execution time have been shown to be unreliable [24], [25], [26]. We evaluate three estimation policies: Perfect: Perfect a priori knowledge of job duration.…”

Section: B Execution Time Estimationmentioning

confidence: 99%

Energy-Aware Simulation of Workflow Execution in High Throughput Computing Systems

McGough

Forshaw

2015

2015 IEEE/ACM 19th International Symposium on Distributed Simulation and Real Time Applications (DS-RT)

View full text Add to dashboard Cite

Additional information: Use policyThe full-text may be used and/or reproduced, and given to third parties in any format or medium, without prior permission or charge, for personal research or study, educational, or not-for-prot purposes provided that:• a full bibliographic reference is made to the original source • a link is made to the metadata record in DRO • the full-text is not changed in any way The full-text must not be sold in any format or medium without the formal permission of the copyright holders.Please consult the full DRO policy for further details. Abstract-Workflows offer a great potential for enacting corelated jobs in an automated manner. This is especially desirable when workflows are large or there is a desire to run a workflow multiple times. Much research has been conducted in reducing the makespan of running workflows and maximising the utilisation of the resources they run on, with some existing research investigates how to reduce the energy consumption of workflows on dedicated resources. We extend the HTC-Sim simulation framework to support workflows allowing us to evaluate different scheduling strategies on the overheads and energy consumption of workflows run on non-dedicated systems. We evaluate a number of scheduling strategies from the literature in an environment where (workflow) jobs can be evicted by higher priority users.

show abstract

Employing Checkpoint to Improve Job Scheduling in Large-Scale Systems

Cited by 12 publications

References 24 publications

Optimized Memoryless Fair-Share HPC Resources Scheduling using Transparent Checkpoint-Restart Preemption

Optimized Memoryless Fair-Share HPC Resources Scheduling using Transparent Checkpoint-Restart Preemption

Evaluation of Energy Consumption of Replicated Tasks in a Volunteer Computing Environment

Energy-Aware Simulation of Workflow Execution in High Throughput Computing Systems

Contact Info

Product

Resources

About