Elastic Deep Learning in Multi-Tenant GPU Clusters

Wu, Yuanyuan; Ma, Kaihao; Xiao, Yan; Zhi, Liu; Cai, Zhenkun; Huang, Yuzhen; Cheng, James; Yuan, Han; Yu, Fan

doi:10.1109/tpds.2021.3064966

Cited by 32 publications

(12 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Elasticity in DL. Recent work [41][42][43][44] has studied how to leverage compute elasticity in related workloads. NumPy-Wren [41] identies and exploits dynamic parallelism in linear algebra algorithms, including matrix multiplication (key to DL) to increase compute eciency.…”

Section: Discussion and Related Workmentioning

confidence: 99%

“…NumPy-Wren [41] identies and exploits dynamic parallelism in linear algebra algorithms, including matrix multiplication (key to DL) to increase compute eciency. One layer higher, [44] and EDL [42] introduce elasticity in DL training workloads, the latter in the context of multi-tenant clusters. TorchElastic [43] similarly provides a interface for dening and executing elastic jobs in a fault-tolerant manner.…”

Section: Discussion and Related Workmentioning

confidence: 99%

See 1 more Smart Citation

RubberBand

Misra

Liaw

Dunlap

et al. 2021

Proceedings of the Sixteenth European Conference on Computer Systems

View full text Add to dashboard Cite

Hyperparameter tuning is essential to achieving state-of-theart accuracy in machine learning (ML), but requires substantial compute resources to perform. Existing systems primarily focus on eectively allocating resources for a hyperparameter tuning job under xed resource constraints. We show that the available parallelism in such jobs changes dynamically over the course of execution and, therefore, presents an opportunity to leverage the elasticity of the cloud.In particular, we address the problem of minimizing the nancial cost of executing a hyperparameter tuning job, subject to a time constraint. We present RubberBand-the rst framework for cost-ecient, elastic execution of hyperparameter tuning jobs in the cloud. RubberBand utilizes performance instrumentation and cloud pricing to model job completion time and cost prior to runtime, and generate a cost-ecient, elastic resource allocation plan. RubberBand is able to eciently execute this plan and realize a cost reduction of up to 2x in comparison to static allocation baselines.CCS Concepts • Computing methodologies → Distributed computing methodologies; Machine learning.

show abstract

Section: Discussion and Related Workmentioning

confidence: 99%

Section: Discussion and Related Workmentioning

confidence: 99%

RubberBand

Misra

Liaw

Dunlap

et al. 2021

Proceedings of the Sixteenth European Conference on Computer Systems

View full text Add to dashboard Cite

show abstract

“…This scheduling goal is to reduce the average queuing and execution time of training workloads in a datacenter. Some advanced strategies with special training configurations (e.g., sharing training, elastic training, heterogeneous training) can help improve the timing efficiency [64,79,85,117,[151][152][153][154], which will be elaborated in Sec. 3.2.…”

Section: Efficiencymentioning

confidence: 99%

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Gao¹,

Hu²,

Ye³

et al. 2022

Preprint

View full text Add to dashboard Cite

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU datacenter. An efficient scheduler design for such GPU datacenter is crucially important to reduce the operational cost and improve resource utilization. However, traditional approaches designed for big data or high performance computing workloads can not support DL workloads to fully utilize the GPU resources. Recently, substantial schedulers are proposed to tailor for DL workloads in GPU datacenters. This paper surveys existing research efforts for both training and inference workloads. We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource consumption features. Finally, we prospect several promising future research directions. More detailed summary with the surveyed paper and code links can be found at our project website: https://github.com/S-Lab-System-Group/Awesome-DL-Scheduling-Papers. CCS Concepts: • General and reference → Surveys and overviews; • Computing methodologies → Machine learning; • Computer systems organization → Cloud computing.

show abstract

“…Recently, the failure tolerant and elastic functions for NN model training on large clusters and in cloud environment have been studied in some works. Wu et al [47] propose a lightweight coordination layer between the cluster scheduler and the deep learning framework to enable elasticity with a simple API. Ma et al [48] adopt the mixed-integer programming model to maximize the training progress in real a production environment.…”

Section: Fault-tolerant and Elastic Trainingmentioning

confidence: 99%

End-to-end Adaptive Distributed Training on PaddlePaddle

Ao¹,

Wu²,

Yu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Distributed training has become a pervasive and effective approach for training a large neural network (NN) model with processing massive data. However, it is very challenging to satisfy requirements from various NN models, diverse computing resources, and their dynamic changes during a training job. In this study, we design our distributed training framework in a systematic end-to-end view to provide the built-in adaptive ability for different scenarios, especially for industrial applications and production environments, by fully considering resource allocation, model partition, task placement, and distributed execution. Based on the unified distributed graph and the unified cluster object, our adaptive framework is equipped with a global cost model and a global planner, which can enable arbitrary parallelism, resource-aware placement, multi-mode execution, fault-tolerant, and elastic distributed training. The experiments demonstrate that our framework can satisfy various requirements from the diversity of applications and the heterogeneity of resources with highly competitive performance. The ERNIE language model with 260 billion parameters is efficiently trained on thousands of AI processors with 91.7% weak scalability. The throughput of the model from the recommender system by employing the heterogeneous pipeline asynchronous execution can be increased up to 2.1 times and 3.3 times that of the GPU-only and CPU-only training respectively. Moreover, the fault-tolerant and elastic distributed training have been successfully applied to the online industrial applications, which give a reduction of 34.49% in the number of failed long-term training jobs and an increase of 33.91% for the global scheduling efficiency in the production environment.

show abstract

Elastic Deep Learning in Multi-Tenant GPU Clusters

Cited by 32 publications

References 24 publications

RubberBand

RubberBand

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

End-to-end Adaptive Distributed Training on PaddlePaddle

Contact Info

Product

Resources

About