Holistic Slowdown Driven Scheduling and Resource Management for Malleable Jobs

D'Amico, Marco; Jokanović, Ana; Corbalan, Julita

doi:10.1145/3337821.3337909

Cited by 17 publications

(21 citation statements)

References 22 publications

(29 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Sarood et al [20] combine malleability and DVFS to create a scheduling policy that adapts the workload to a strict power budget in over-provisioned systems. Similarly, in a precedent research [21], [22], we used malleability and node sharing techniques to reduce response time, makespan, and energy consumption. The integration of the policies is a promising research direction.…”

Section: Related Workmentioning

confidence: 99%

“…Users are still in charge of specifying the number of requested nodes and processes. The number of threads or processes per node can be configured by using Slurm plugins or precedent work [21], [22] more intelligently. If a memory requirement is specified and some nodes cannot satisfy this requirement, a new number of nodes will be calculated.…”

Section: User Interface For Job Submissionmentioning

confidence: 99%

See 1 more Smart Citation

Energy hardware and workload aware job scheduling towards interconnected HPC environments

D'Amico,

Corbalan

2021

Preprint

Self Cite

View full text Add to dashboard Cite

New HPC machines are getting close to the exascale. Power consumption for those machines has been increasing, and researchers are studying ways to reduce it. A second trend is HPC machines' growing complexity, with increasing heterogeneous hardware components and different clusters architectures cooperating in the same machine. We refer to these environments with the term heterogeneous multi-cluster environments. With the aim of optimizing performance and energy consumption in these environments, this paper proposes an Energy-Aware-Multi-Cluster (EAMC) job scheduling policy. EAMC-policy is able to optimize the scheduling and placement of jobs by predicting performance and energy consumption of arriving jobs for different hardware architectures and processor frequencies, reducing workload's energy consumption, makespan, and response time. The policy assigns a different priority to each job-resource combination so that the most efficient ones are favored, while less efficient ones are still considered on a variable degree, reducing response time and increasing cluster utilization. We implemented EAMC-policy in Slurm, and we evaluated a scenario in which two CPU clusters collaborate in the same machine. Simulations of workloads running applications modeled from real-world show a reduction of response time and makespan by up to 25% and 6% while saving up to 20% of total energy consumed when compared to policies minimizing runtime, and by 49%, 26%, and 6% compared to policies minimizing energy.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: User Interface For Job Submissionmentioning

confidence: 99%

Energy hardware and workload aware job scheduling towards interconnected HPC environments

D'Amico,

Corbalan

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…For HPC workloads, static co-location [7,17,52,54] has been proposed to reduce application interference. With static co-location, system resources are statically partitioned, and each application runs confined in a specific partition.…”

Section: Introductionmentioning

confidence: 99%

“…All approaches to improve node-level efficiency are translated into cluster-level improvements by fitting batch schedulers like SLURM [53] with job co-scheduling capabilities that consider how node-level resources are shared [17,23,34,52,54,55]. The coscheduling problem is related to node-level efficiency: better alternatives to in-node resource sharing can be combined with existing scheduling policies to deliver better cluster-level efficiency.…”

Section: Introductionmentioning

confidence: 99%

nOS-V: Co-Executing HPC Applications Using System-Wide Task Scheduling

Álvarez¹,

Sala²,

Beltran³

2022

Preprint

View full text Add to dashboard Cite

Future Exascale systems will feature massive parallelism, manycore processors and heterogeneous architectures. In this scenario, it is increasingly difficult for HPC applications to fully and efficiently utilize the resources in system nodes. Moreover, the increased parallelism exacerbates the effects of existing inefficiencies in current applications. Research has shown that co-scheduling applications to share system nodes instead of executing each application exclusively can increase resource utilization and efficiency. Nevertheless, the current oversubscription and co-location techniques to share nodes have several drawbacks which limit their applicability and make them very application-dependent.This paper presents co-execution through system-wide scheduling. Co-execution is a novel fine-grained technique to execute multiple HPC applications simultaneously on the same node, outperforming current state-of-the-art approaches. We implement this technique in nOS-V, a lightweight tasking library that supports coexecution through system-wide task scheduling. Moreover, nOS-V can be easily integrated with existing programming models, requiring no changes to user applications. We showcase how co-execution with nOS-V significantly reduces schedule makespan for several applications on single node and distributed environments, outperforming prior node-sharing techniques.

show abstract

“…In practice, production RJMS systems, such as Slurm [7] and Torque [8], support static resource allocation (SRA), where the number of the allocated resources is defined before the job execution and cannot subsequently be changed. Certain research efforts have extended production RJMS to support malleability [9,10].…”

Section: Introductionmentioning

confidence: 99%

A Resourceful Coordination Approach for Multilevel Scheduling

Eleliemy,

Ciorba

2021

Preprint

View full text Add to dashboard Cite

HPC users aim to improve their execution times without particular regard for increasing system utilization. On the contrary, HPC operators favor increasing the number of executed applications per time unit and increasing system utilization. This difference in the preferences promotes the following operational model. Applications execute on exclusively-allocated computing resources for a specific time, and applications are assumed to utilize the allocated resources efficiently. In many cases, this operational model is inefficient, i.e., applications may not fully utilize their allocated resources. This inefficiency results in increasing application execution time and decreasing system utilization. In this work, we propose a resourceful coordination approach (RCA) that enables the cooperation between, currently independent, batch-and application-level schedulers. RCA enables application schedulers to share their allocated but idle computing resources with other applications through the batch system. The effective system performance (ESP) benchmark is used to assess the proposed approach. The results show that RCA increased system utilization up to 12.6% and decreased system makespan by the same percent without affecting applications' performance.

show abstract

Holistic Slowdown Driven Scheduling and Resource Management for Malleable Jobs

Cited by 17 publications

References 22 publications

Energy hardware and workload aware job scheduling towards interconnected HPC environments

Energy hardware and workload aware job scheduling towards interconnected HPC environments

nOS-V: Co-Executing HPC Applications Using System-Wide Task Scheduling

A Resourceful Coordination Approach for Multilevel Scheduling

Contact Info

Product

Resources

About