Elastic parameter server load distribution in deep learning clusters

Chen, Yangrui; Peng, Yanghua; Bao, Yixin; Wu, Chuan; Zhu, Yibo; Guo, Chuanxiong

doi:10.1145/3419111.3421307

Cited by 31 publications

(6 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In Horus, the over-commitment threshold can be configured based on the number of co-located jobs or device memory usage by DL system operators. Apart from failures, stragglers can be present in the cluster and elastic training regime is a practical way of addressing the issue [65]. However, it is not the core focus of this work.…”

Section: System Implementationmentioning

confidence: 99%

Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

Yeung

Borowiec

Yang

et al. 2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

To accelerate the training of Deep Learning (DL) models, clusters of machines equipped with hardware accelerators such as GPUs are leveraged to reduce execution time. State-of-the-art resource managers are needed to increase GPU utilization and maximize throughput. While co-locating DL jobs on the same GPU has been shown to be effective, this can incur interference causing slowdown. In this paper we propose Horus: an interference-aware and prediction-based resource manager for DL systems. Horus proactively predicts GPU utilization of heterogeneous DL jobs extrapolated from the DL model's computation graph features, removing the need for online profiling and isolated reserved GPUs. Through micro-benchmarks and job co-location combinations across heterogeneous GPU hardware, we identify GPU utilization as a general proxy metric to determine good placement decisions, in contrast to current approaches which reserve isolated GPUs to perform online profiling and directly measure GPU utilization for each unique submitted job. Our approach promotes high resource utilization and makespan reduction; via real-world experimentation and large-scale trace driven simulation, we demonstrate that Horus outperforms other DL resource managers by up to 61.5% for GPU resource utilization, 23.7-30.7% for makespan reduction and 68.3% in job wait time reduction.

show abstract

Section: System Implementationmentioning

confidence: 99%

Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

Yeung

Borowiec

Yang

et al. 2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…Nowadays, some dynamic parameter assignment methods are proposed. LAPSE [16] supports to allocate parameters dynamically, and explores the possibility of dynamic parameter allocation employed in PS. PSLD [17] proposes a prediction-guided exploitation-exploration approach for dynamic PS load distribution, and supports the dynamic parameter reassignment.…”

Section: Parameter Index and Partitionmentioning

confidence: 99%

DRPS: efficient disk-resident parameter servers for distributed machine learning

Song

Zhi-gang

2021

Front. Comput. Sci.

View full text Add to dashboard Cite

Parameter server (PS) as the state-of-the-art distributed framework for large-scale iterative machine learning tasks has been extensively studied. However, existing PS-based systems often depend on memory implementations. With memory constraints, machine learning (ML) developers cannot train large-scale ML models in their rather small local clusters. Moreover, renting large-scale cloud servers is always economically infeasible for research teams and small companies. In this paper, we propose a disk-resident parameter server system named DRPS, which reduces the hardware requirement of large-scale machine learning tasks by storing high dimensional models on disk. To further improve the performance of DRPS, we build an efficient index structure for parameters to reduce the disk I/O cost. Based on this index structure, we propose a novel multi-objective partitioning algorithm for the parameters. Finally, a flexible workerselection parallel model of computation (WSP) is proposed to strike a right balance between the problem of inconsistent parameter versions (staleness) and that of inconsistent execution progresses (straggler). Extensive experiments on many typical machine learning applications with real and synthetic datasets validate the effectiveness of DRPS.

show abstract

“…As the model and dataset sizes have increased for ML training jobs, large-scale distributed training has become increasingly important [1,13,14,22,34,39,41,42,48,68,82,94,117]. In this paper, we focus specifically on data-parallel training, a common approach to distributed training.…”

Section: Case Study: Distributed ML Trainingmentioning

confidence: 99%

Unlocking the Power of Inline Floating-Point Operations on Programmable Switches

Yang¹,

Alama²,

Sapio³

et al. 2021

Preprint

View full text Add to dashboard Cite

1 The advent of switches with programmable dataplanes has enabled the rapid development of new network functionality, as well as providing a platform for acceleration of a broad range of application-level functionality. However, existing switch hardware was not designed with application acceleration in mind, and thus applications requiring operations or datatypes not used in traditional network protocols must resort to expensive workarounds. Applications involving floating point data, including distributed training for machine learning and distributed query processing, are key examples.In this paper, we propose FPISA, a floating point representation designed to work efficiently in programmable switches. We first implement FPISA on an Intel Tofino switch, but find that it has limitations that impact throughput and accuracy. We then propose hardware changes to address these limitations based on the open-source Banzai switch architecture, and synthesize them in a 15-nm standard-cell library to demonstrate their feasibility. Finally, we use FPISA to implement accelerators for training for machine learning and for query processing, and evaluate their performance on a switch implementing our changes using emulation. We find that FPISA allows distributed training to use 25-75% fewer CPU cores and provide up to 85.9% better throughput in a CPU-constrained environment than SwitchML. For distributed query processing with floating point data, FPISA enables up to 2.7× better throughput than Spark.

show abstract

Elastic parameter server load distribution in deep learning clusters

Cited by 31 publications

References 27 publications

Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

DRPS: efficient disk-resident parameter servers for distributed machine learning

Unlocking the Power of Inline Floating-Point Operations on Programmable Switches

Contact Info

Product

Resources

About