Proceedings of the 48th International Conference on Parallel Processing 2019
DOI: 10.1145/3337821.3337909
|View full text |Cite
|
Sign up to set email alerts
|

Holistic Slowdown Driven Scheduling and Resource Management for Malleable Jobs

Abstract: In job scheduling, the concept of malleability has been explored since many years ago. Research shows that malleability improves system performance, but its utilization in HPC never became widespread. The causes are the difficulty in developing malleable applications, and the lack of support and integration of the different layers of the HPC software stack. However, in the last years, malleability in job scheduling is becoming more critical because of the increasing complexity of hardware and workloads. In thi… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
21
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 17 publications
(21 citation statements)
references
References 22 publications
(29 reference statements)
0
21
0
Order By: Relevance
“…Sarood et al [20] combine malleability and DVFS to create a scheduling policy that adapts the workload to a strict power budget in over-provisioned systems. Similarly, in a precedent research [21], [22], we used malleability and node sharing techniques to reduce response time, makespan, and energy consumption. The integration of the policies is a promising research direction.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Sarood et al [20] combine malleability and DVFS to create a scheduling policy that adapts the workload to a strict power budget in over-provisioned systems. Similarly, in a precedent research [21], [22], we used malleability and node sharing techniques to reduce response time, makespan, and energy consumption. The integration of the policies is a promising research direction.…”
Section: Related Workmentioning
confidence: 99%
“…Users are still in charge of specifying the number of requested nodes and processes. The number of threads or processes per node can be configured by using Slurm plugins or precedent work [21], [22] more intelligently. If a memory requirement is specified and some nodes cannot satisfy this requirement, a new number of nodes will be calculated.…”
Section: User Interface For Job Submissionmentioning
confidence: 99%
“…For HPC workloads, static co-location [7,17,52,54] has been proposed to reduce application interference. With static co-location, system resources are statically partitioned, and each application runs confined in a specific partition.…”
Section: Introductionmentioning
confidence: 99%
“…All approaches to improve node-level efficiency are translated into cluster-level improvements by fitting batch schedulers like SLURM [53] with job co-scheduling capabilities that consider how node-level resources are shared [17,23,34,52,54,55]. The coscheduling problem is related to node-level efficiency: better alternatives to in-node resource sharing can be combined with existing scheduling policies to deliver better cluster-level efficiency.…”
Section: Introductionmentioning
confidence: 99%
“…In practice, production RJMS systems, such as Slurm [7] and Torque [8], support static resource allocation (SRA), where the number of the allocated resources is defined before the job execution and cannot subsequently be changed. Certain research efforts have extended production RJMS to support malleability [9,10].…”
Section: Introductionmentioning
confidence: 99%