2021
DOI: 10.1007/978-3-030-88224-2_7
|View full text |Cite
|
Sign up to set email alerts
|

A HPC Co-scheduler with Reinforcement Learning

Abstract: Although High Performance Computing (HPC) users understand basic resource requirements such as the number of CPUs and memory limits, internal infrastructural utilization data is exclusively leveraged by cluster operators, who use it to configure batch schedulers. This task is challenging and increasingly complex due to ever larger cluster scales and heterogeneity of modern scientific workflows. As a result, HPC systems achieve low utilization with long job completion times (makespans). To tackle these challeng… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 40 publications
0
3
0
Order By: Relevance
“…Moreover, RL are widely adopted in resource management and job scheduling. [19][20][21][22][23][24][25] They focus on the general scenarios like HPC clusters and datacenters where the workloads are diverse. The workloads are always characterized by the resource requirements, including memory, CPU, bandwidth, storage.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…Moreover, RL are widely adopted in resource management and job scheduling. [19][20][21][22][23][24][25] They focus on the general scenarios like HPC clusters and datacenters where the workloads are diverse. The workloads are always characterized by the resource requirements, including memory, CPU, bandwidth, storage.…”
Section: Discussionmentioning
confidence: 99%
“…In addition, research 36 shows that an RL‐based approach can outperform other hand‐crafted methods in the network architecture search (NAS) field, while researches 15‐18 proved that RL techniques can be applied to solve additional problems related to communication optimization and network research. Moreover, RL are widely adopted in resource management and job scheduling 19‐25 . They focus on the general scenarios like HPC clusters and datacenters where the workloads are diverse.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation