2022
DOI: 10.1109/tpds.2021.3064966
|View full text |Cite
|
Sign up to set email alerts
|

Elastic Deep Learning in Multi-Tenant GPU Clusters

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
12
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
4
1

Relationship

1
8

Authors

Journals

citations
Cited by 32 publications
(12 citation statements)
references
References 24 publications
0
12
0
Order By: Relevance
“…Elasticity in DL. Recent work [41][42][43][44] has studied how to leverage compute elasticity in related workloads. NumPy-Wren [41] identies and exploits dynamic parallelism in linear algebra algorithms, including matrix multiplication (key to DL) to increase compute eciency.…”
Section: Discussion and Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Elasticity in DL. Recent work [41][42][43][44] has studied how to leverage compute elasticity in related workloads. NumPy-Wren [41] identies and exploits dynamic parallelism in linear algebra algorithms, including matrix multiplication (key to DL) to increase compute eciency.…”
Section: Discussion and Related Workmentioning
confidence: 99%
“…NumPy-Wren [41] identies and exploits dynamic parallelism in linear algebra algorithms, including matrix multiplication (key to DL) to increase compute eciency. One layer higher, [44] and EDL [42] introduce elasticity in DL training workloads, the latter in the context of multi-tenant clusters. TorchElastic [43] similarly provides a interface for dening and executing elastic jobs in a fault-tolerant manner.…”
Section: Discussion and Related Workmentioning
confidence: 99%
“…This scheduling goal is to reduce the average queuing and execution time of training workloads in a datacenter. Some advanced strategies with special training configurations (e.g., sharing training, elastic training, heterogeneous training) can help improve the timing efficiency [64,79,85,117,[151][152][153][154], which will be elaborated in Sec. 3.2.…”
Section: Efficiencymentioning
confidence: 99%
“…Recently, the failure tolerant and elastic functions for NN model training on large clusters and in cloud environment have been studied in some works. Wu et al [47] propose a lightweight coordination layer between the cluster scheduler and the deep learning framework to enable elasticity with a simple API. Ma et al [48] adopt the mixed-integer programming model to maximize the training progress in real a production environment.…”
Section: Fault-tolerant and Elastic Trainingmentioning
confidence: 99%