Proceedings of the 11th ACM Symposium on Cloud Computing 2020
DOI: 10.1145/3419111.3421307
|View full text |Cite
|
Sign up to set email alerts
|

Elastic parameter server load distribution in deep learning clusters

Abstract: In distributed DNN training, parameter servers (PS) can become performance bottlenecks due to PS stragglers, caused by imbalanced parameter distribution, bandwidth contention, or computation interference. Few existing studies have investigated efficient parameter (aka load) distribution among PSs. We observe significant training inefficiency with the current parameter assignment in representative machine learning frameworks (e.g., MXNet, TensorFlow), and big potential for training acceleration with better PS l… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 31 publications
(6 citation statements)
references
References 27 publications
0
6
0
Order By: Relevance
“…In Horus, the over-commitment threshold can be configured based on the number of co-located jobs or device memory usage by DL system operators. Apart from failures, stragglers can be present in the cluster and elastic training regime is a practical way of addressing the issue [65]. However, it is not the core focus of this work.…”
Section: System Implementationmentioning
confidence: 99%
“…In Horus, the over-commitment threshold can be configured based on the number of co-located jobs or device memory usage by DL system operators. Apart from failures, stragglers can be present in the cluster and elastic training regime is a practical way of addressing the issue [65]. However, it is not the core focus of this work.…”
Section: System Implementationmentioning
confidence: 99%
“…Nowadays, some dynamic parameter assignment methods are proposed. LAPSE [16] supports to allocate parameters dynamically, and explores the possibility of dynamic parameter allocation employed in PS. PSLD [17] proposes a prediction-guided exploitation-exploration approach for dynamic PS load distribution, and supports the dynamic parameter reassignment.…”
Section: Parameter Index and Partitionmentioning
confidence: 99%
“…As the model and dataset sizes have increased for ML training jobs, large-scale distributed training has become increasingly important [1,13,14,22,34,39,41,42,48,68,82,94,117]. In this paper, we focus specifically on data-parallel training, a common approach to distributed training.…”
Section: Case Study: Distributed ML Trainingmentioning
confidence: 99%