Scheduling Beyond CPUs for HPC

Fan, Yuping; Lan, Zhiling; Rich, Paul M.; Allcock, William E.; Papka, Michael E.; Austin, Brian; Paul, David

doi:10.1145/3307681.3325401

Cited by 33 publications

(14 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Further work has demonstrated reconigurable accelerators that rely on ield programmable gate arrays (FPGAs) [40,70] or ASICs [81]. Consequently, past work has examined how job scheduling should consider heterogeneous resource requests [8,30], how the operating system (OS) and runtime should adapt [42,57], how to write applications for heterogeneous systems [8,32], how to partition data-parallel applications onto heterogeneous compute resources [48], how to consider the diferent fault tolerances of heterogeneous resources [41], how to fairly compare the performance of diferent heterogeneous systems [44], and what the impact of heterogeneous resources is to application performance [52,74,80].…”

Section: Background and Related Work 21 Resource Heterogeneity In Hpcmentioning

confidence: 99%

A Case For Intra-rack Resource Disaggregation in HPC

Klenk

Cook

Teh

et al. 2022

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

The expected halt of traditional technology scaling is motivating increased heterogeneity in high performance computing (HPC) systems with the emergence of numerous specialized accelerators. As heterogeneity increases, so does the risk of underutilizing expensive hardware resources if we preserve today’s rigid node configuration and reservation strategies. This has sparked interest in resource disaggregation to enable finer-grain allocation of hardware resources to applications. However, there is currently no data-driven study of what range of disaggregation is appropriate in HPC. To that end, we perform a detailed analysis of key metrics sampled in NERSC’s Cori, a production HPC system that executes a diverse open-science HPC workload. In addition, we profile a variety of deep learning applications to represent an emerging workload. We show that for a rack (cabinet) configuration and applications similar to Cori, a central processing unit (CPU) with intra-rack disaggregation has a 99.5% probability to find all resources it requires inside its rack. In addition, ideal intra-rack resource disaggregation in Cori could reduce memory and NIC resources by 5.36% to 69.01% and still satisfy the worst-case average rack utilization.

show abstract

Section: Background and Related Work 21 Resource Heterogeneity In Hpcmentioning

confidence: 99%

A Case For Intra-rack Resource Disaggregation in HPC

Klenk

Cook

Teh

et al. 2022

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…Therefore, multi-resource HPC scheduling demands more complicated scheduling methods. Optimization methods, especially multiobjective optimization methods, are leveraged to achieve better system performance in HPC scheduling [10], [11], [21].…”

Section: Multi-resource Schedulingmentioning

confidence: 99%

Job Scheduling in High Performance Computing

Fan

2021

Preprint

Self Cite

View full text Add to dashboard Cite

The ever-growing processing power of supercomputers in recent decades enables us to explore increasing complex scientific problems. Effective scheduling these jobs is crucial for individual job performance and system efficiency. The traditional job schedulers in high performance computing (HPC) are simple and concentrate on improving CPU utilization. The emergence of new hardware resources and novel hardware structure impose severe challenges on traditional schedulers. The increasing diverse workloads, including compute-intensive and data-intensive applications, require more efficient schedulers. Even worse, the above two factors interplay with each other, which makes scheduling problem even more challenging. In recent years, many research has discussed new scheduling methods to combat the problems brought by rapid system changes. In this research study, we have investigated challenges faced by HPC scheduling and state-of-art scheduling methods to overcome these challenges. Furthermore, we propose an intelligent scheduling framework to alleviate the problems encountered in modern job scheduling.

show abstract

“…The underlying CQSim scheduling simulator has been successfully supporting a number of projects in this field over a decade [8,[19][20][21][22][23][24][25][26][27][28][29][30]. CQSim provides a unified platform to evaluate the performance of various methods with minimal overheads.…”

Section: Impactmentioning

confidence: 99%

“…By identifying these factors, the system administrators could develop new policies to minimize the impacts of these factors. [20,21,23,26] aim to find the scheduling strategies to handle multiple resources, i.e., CPU, burst buffer, GPU, and power. CQSim plays a crucial role in these multi-resource projects, because CQSim simulator provides a virtual configurable platform to identify the best scheduling policy to schedule specific resources on a given system before deployment on real systems.…”

Section: Impactmentioning

confidence: 99%

DRAS-CQSim: A reinforcement learning based framework for HPC cluster scheduling

Fan

Lan

2021

Software Impacts

Self Cite

View full text Add to dashboard Cite

For decades, system administrators have been striving to design and tune cluster scheduling policies to improve the performance of high performance computing (HPC) systems. However, the increasingly complex HPC systems combined with highly diverse workloads make such manual process challenging, time-consuming, and error-prone. We present a reinforcement learning based HPC scheduling framework named DRAS-CQSim to automatically learn optimal scheduling policy. DRAS-CQSim encapsulates simulation environments, agents, hyperparameter tuning options, and different reinforcement learning algorithms, which allows the system administrators to quickly obtain customized scheduling policies.

show abstract

Scheduling Beyond CPUs for HPC

Cited by 33 publications

References 26 publications

A Case For Intra-rack Resource Disaggregation in HPC

A Case For Intra-rack Resource Disaggregation in HPC

Job Scheduling in High Performance Computing

DRAS-CQSim: A reinforcement learning based framework for HPC cluster scheduling

Contact Info

Product

Resources

About