Improving Fairness in a Large Scale HTC System Through Workload Analysis and Simulation

Azevedo, Frédéric; Klusáček, Dalibor; Suter, Frédéric

doi:10.1007/978-3-030-29400-7_10

Cited by 5 publications

(5 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Also users and/or groups are often subject to a upper bound on the amount of resources they can use simultaneously. For this purpose, Alea provides CPU quotas, that guarantee that a user/group will not exceed the corresponding maximum allowed share of resources [2].…”

Section: Detailed System Simulation Capabilitiesmentioning

confidence: 99%

“…Using Alea, we were able to model the system and evaluate new setups for the system's queues and the per-group CPU quotas. This new setup allowed for improved fairness for local users, by better balancing their wait times with the wait times of grid-originating jobs [2].…”

Section: Improving Fairness In Large Htc Systemmentioning

confidence: 99%

“…Since then, many new features have been implemented and the simulator has been successfully used for various purposes, both as a purely research tool as well as when testing new setups and new scheduling policies for production HPC and HTC systems. The main contribution of this paper is that (1) we describe recent improvements in the simulator, that allow for truly complex simulations that involve several detailed setups that correspond to typical real-life based scenarios, (2) we describe the recent speedup of the simulator that enables us to run truly large-scale simulations involving millions of jobs and thousands of nodes that complete in just a few hours, and (3) we provide several real-life based case studies where Alea has been used to develop and evaluate effects of major modifications of real HPC and HTC systems.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Alea – Complex Job Scheduling Simulator

Klusáček¹,

Soysal²,

Suter³

2020

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Using large computer systems such as HPC clusters up to their full potential can be hard. Many problems and inefficiencies relate to the interactions of user workloads and system-level policies. These policies enable various setup choices of the resource management system (RMS) as well as the applied scheduling policy. While expert's assessment and well known best practices do their job when tuning the performance, there is usually plenty of room for further improvements, e.g., by considering more efficient system setups or even radically new scheduling policies. For such potentially damaging modifications it is very suitable to use some form of a simulator first, which allows for repeated evaluations of various setups in a fully controlled manner. This paper presents the latest improvements and advanced simulation capabilities of the Alea job scheduling simulator that has been actively developed for over 10 years now. We present both recently added advanced simulation capabilities as well as a set of real-life based case studies where Alea has been used to evaluate major modifications of real HPC and HTC systems.

show abstract

Section: Detailed System Simulation Capabilitiesmentioning

confidence: 99%

Section: Improving Fairness In Large Htc Systemmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Alea – Complex Job Scheduling Simulator

Klusáček¹,

Soysal²,

Suter³

2020

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…In a previous study we showed that two distinct sub-workloads are executed at CC-IN2P3 [2]. Some jobs are submitted by a small number of large user groups through a Grid middleware, at a nearly constant rate and with an important upstream control of the submissions while Local users from about 60 dierent groups directly submit their jobs to the batch system.…”

mentioning

confidence: 99%

“…This workload is composed of 7,749,500 Grid jobs and 5,748,922 Local jobs, for a total of 13,498,422 jobs. Hereafter we focus only on the Local jobs, as they experience larger wait times than Grid jobs [2].…”

mentioning

confidence: 99%