2022 IEEE International Performance, Computing, and Communications Conference (IPCCC) 2022
DOI: 10.1109/ipccc55026.2022.9894345
|View full text |Cite
|
Sign up to set email alerts
|

PickyMan: A Preemptive Scheduler for Deep Learning Jobs on GPU Clusters

Abstract: Deep learning (DL) jobs normally run on GPU clusters. Some DL jobs need to be scheduled preemptively to avoid long waiting times. However, preempting a DL job is time-consuming, which consists of suspending and resuming. Suspending needs to complete the training process of the current epoch, and resuming needs to reload the model and the training data. The existing schedulers almost do not consider the overhead of preempting jobs; thus, they may preempt jobs with large time loss, increasing the waiting time an… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 12 publications
0
1
0
Order By: Relevance
“…PickyMan [23] and Lucid [24] address the issues related to preemption overheads. Preemption is allowed by PickyMan, which minimizes it by predicting the execution times using network traffic and historical data, and by greedily choosing the appropriate job to stop, while it is not allowed by Lucid.…”
Section: Related Workmentioning
confidence: 99%
“…PickyMan [23] and Lucid [24] address the issues related to preemption overheads. Preemption is allowed by PickyMan, which minimizes it by predicting the execution times using network traffic and historical data, and by greedily choosing the appropriate job to stop, while it is not allowed by Lucid.…”
Section: Related Workmentioning
confidence: 99%