CoSim: A Simulator for Co-Scheduling of Batch and On-Demand Jobs in HPC Datacenters

Maurya, Avinash; Nicolae, Bogdan; Guliani, Ishan; Rafique, M. Mustafa

doi:10.1109/ds-rt50469.2020.9213578

Cited by 7 publications

(4 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Currently, such techniques use local storage independently on each compute node via a single shared link, but can be complemented to leverage local storage of remote nodes. Additionally, checkpoint-restart techniques are also used for accommodating on-demand jobs with batch jobs [4], [5] and workload migration [6], [7].…”

Section: Related Workmentioning

confidence: 99%

Towards Efficient I/O Scheduling for Collaborative Multi-Level Checkpointing

Maurya

Nicolae

Rafique

et al. 2021

2021 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)

Self Cite

View full text Add to dashboard Cite

Efficient checkpointing of distributed data structures periodically at key moments during runtime is a recurring fundamental pattern in a large number of uses cases: fault tolerance based on checkpoint-restart, in-situ or post-analytics, reproducibility, adjoint computations, etc. In this context, multilevel checkpointing is a popular technique: distributed processes can write their shard of the data independently to fast local storage tiers, then flush asynchronously to a shared, slower tier of higher capacity. However, given the limited capacity of fast tiers (e.g. GPU memory) and the increasing checkpoint frequency, the processes often run out of space and need to fall back to blocking writes to the slow tiers. To mitigate this problem, compression is often applied in order to reduce the checkpoint sizes. Unfortunately, this reduction is not uniform: some processes will have spare capacity left on the fast tiers, while others still run out of space. In this paper, we study the problem of how to leverage this imbalance in order to reduce I/O overheads for multi-level checkpointing. To this end, we solve an optimization problem of how much data to send from each process that runs out of space to the processes that have spare capacity in order to minimize the amount of time spent blocking in I/O. We propose two algorithms: one based on a greedy approach and the other based on modified minimum cost flows. We evaluate our proposal using synthetic and real-life application traces. Our evaluation shows that both algorithms achieve significant improvements in checkpoint performance over traditional multilevel checkpointing.

show abstract

Section: Related Workmentioning

confidence: 99%

Towards Efficient I/O Scheduling for Collaborative Multi-Level Checkpointing

Maurya

Nicolae

Rafique

et al. 2021

2021 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)

Self Cite

View full text Add to dashboard Cite

show abstract

“…In the realm of executing on-demand jobs and rigid jobs on HPC systems, several groups have proposed to statically or dynamically reserve resources for on-demand requests. Dynamical reservation was achieved by predicting the ondemand request patterns [5], [6]. In terms of accommodating malleable jobs and rigid jobs on HPC systems, several attempts have been made to shrink malleable jobs in order to reduce resource fragmentation problems [8], [12], [13].…”

Section: B Job Scheduling In Hpcmentioning

confidence: 99%

“…Research on co-scheduling rigid and on-demand applications often aims at the high responsiveness of on-demand jobs. The common strategies include predicting on-demand jobs' requests, reserving resources for on-demand jobs, and preempting rigid jobs to make room for on-demand jobs [5], [6]. Other studies focus on co-scheduling malleable jobs with rigid jobs on HPC systems [7]- [13].…”

Section: Introductionmentioning

confidence: 99%

Hybrid Workload Scheduling on HPC Systems

Yang¹,

Rich²,

Allcock³

et al. 2021

Preprint

View full text Add to dashboard Cite

Traditionally, on-demand, rigid, and malleable applications have been scheduled and executed on separate systems. The ever-growing workload demands and rapidly developing HPC infrastructure trigger the interest of converging these applications on a single HPC system. Although allocating the hybrid workloads within one system could potentially improve system efficiency, it is difficult to balance the tradeoff between the responsiveness of on-demand requests, the incentive for malleable jobs, and the performance of rigid applications. In this study, we present several scheduling mechanisms to address the issues involved in co-scheduling on-demand, rigid, and malleable jobs on a single HPC system. We extensively evaluate and compare their performance under various configurations and workloads. Our experimental results show that our proposed mechanisms are capable of serving on-demand workloads with minimal delay, offering incentives for declaring malleability, and improving system performance.

show abstract

“…Furthermore, workflows that run multiple DL training in-stances and/or integrate them with other tasks are becoming increasingly complex, causing unexpected events that mimic the impact of failures. For example, if a high-priority, ondemand task needs to be started immediately, then some of the workers of the data-parallel training may need to be killed in a timely fashion, leaving very narrow room to react [4].…”

Section: Introductionmentioning

confidence: 99%

Towards Low-Overhead Resilience for Data Parallel Deep Learning

Nicolae¹,

Hobson²,

Yildiz³

et al. 2022

2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

Self Cite

View full text Add to dashboard Cite

Data parallel techniques have been widely adopted both in academia and industry as a tool to enable scalable training of deep learning models. At scale, DL training jobs can fail due to software or hardware bugs, may need to be preempted or terminated due to unexpected events, or may perform suboptimally because they were misconfigured. Under such circumstances, there is a need to recover and/or reconfigure data-parallel DL training jobs on-the-fly, while minimizing the impact on the accuracy of the DNN model and the runtime overhead. In this regard, state-of-art techniques adopted by the HPC community mostly rely on checkpoint-restart, which inevitably leads to loss of progress, thus increasing the runtime overhead. In this paper we explore alternative techniques that exploit the properties of modern deep learning frameworks (overlapping of gradient averaging and weight updates with local gradient computations through pipeline parallelism) to reduce the overhead of resilience/elasticity. To this end we introduce a failure simulation framework and two resilience strategies (immediate mini-batch rollback and lossy forward recovery), which we study compared with checkpoint-restart approaches in a variety of settings in order to understand the trade-offs between the accuracy loss of the DNN model and the runtime overhead.

show abstract

CoSim: A Simulator for Co-Scheduling of Batch and On-Demand Jobs in HPC Datacenters

Cited by 7 publications

References 21 publications

Towards Efficient I/O Scheduling for Collaborative Multi-Level Checkpointing

Towards Efficient I/O Scheduling for Collaborative Multi-Level Checkpointing

Hybrid Workload Scheduling on HPC Systems

Towards Low-Overhead Resilience for Data Parallel Deep Learning

Contact Info

Product

Resources

About