2020 IEEE/ACM 24th International Symposium on Distributed Simulation and Real Time Applications (DS-RT) 2020
DOI: 10.1109/ds-rt50469.2020.9213578
|View full text |Cite
|
Sign up to set email alerts
|

CoSim: A Simulator for Co-Scheduling of Batch and On-Demand Jobs in HPC Datacenters

Abstract: The increasing scale and complexity of scientific applications are rapidly transforming the ecosystem of tools, methods, and workflows adopted by the high-performance computing (HPC) community. Big data analytics and deep learning are gaining traction as essential components in this ecosystem in a variety of scenarios, such as, steering of experimental instruments, acceleration of high-fidelity simulations through surrogate computations, and guided ensemble searches. In this context, the batch job model tradit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1

Relationship

3
3

Authors

Journals

citations
Cited by 7 publications
(4 citation statements)
references
References 21 publications
0
4
0
Order By: Relevance
“…Currently, such techniques use local storage independently on each compute node via a single shared link, but can be complemented to leverage local storage of remote nodes. Additionally, checkpoint-restart techniques are also used for accommodating on-demand jobs with batch jobs [4], [5] and workload migration [6], [7].…”
Section: Related Workmentioning
confidence: 99%
“…Currently, such techniques use local storage independently on each compute node via a single shared link, but can be complemented to leverage local storage of remote nodes. Additionally, checkpoint-restart techniques are also used for accommodating on-demand jobs with batch jobs [4], [5] and workload migration [6], [7].…”
Section: Related Workmentioning
confidence: 99%
“…In the realm of executing on-demand jobs and rigid jobs on HPC systems, several groups have proposed to statically or dynamically reserve resources for on-demand requests. Dynamical reservation was achieved by predicting the ondemand request patterns [5], [6]. In terms of accommodating malleable jobs and rigid jobs on HPC systems, several attempts have been made to shrink malleable jobs in order to reduce resource fragmentation problems [8], [12], [13].…”
Section: B Job Scheduling In Hpcmentioning
confidence: 99%
“…Research on co-scheduling rigid and on-demand applications often aims at the high responsiveness of on-demand jobs. The common strategies include predicting on-demand jobs' requests, reserving resources for on-demand jobs, and preempting rigid jobs to make room for on-demand jobs [5], [6]. Other studies focus on co-scheduling malleable jobs with rigid jobs on HPC systems [7]- [13].…”
Section: Introductionmentioning
confidence: 99%
“…Furthermore, workflows that run multiple DL training in-stances and/or integrate them with other tasks are becoming increasingly complex, causing unexpected events that mimic the impact of failures. For example, if a high-priority, ondemand task needs to be started immediately, then some of the workers of the data-parallel training may need to be killed in a timely fashion, leaving very narrow room to react [4].…”
Section: Introductionmentioning
confidence: 99%