Efficient Online Scheduling for Coflow-Aware Machine Learning Clusters

Li, Wenxin; Chen, Sheng; Li, Keqiu; Qi, Heng; Xu, Renhai; Zhang, Song

doi:10.1109/tcc.2020.3040312

Cited by 14 publications

(4 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Specifically, it adopts a random forest model to predict job resource requirements and then uses the best-fit algorithm and grouping genetic algorithm to optimize the execution performance of DL jobs. Parrot [88] is a framework to manage network bandwidth contention among training jobs using the PS architecture. The communication scheme in a PS workload exhibits a coflow chain dependency where the event of parameter-pull happens after the event of parameter-push.…”

Section: Heterogeneousmentioning

confidence: 99%

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Gao¹,

Hu²,

Ye³

et al. 2022

Preprint

View full text Add to dashboard Cite

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU datacenter. An efficient scheduler design for such GPU datacenter is crucially important to reduce the operational cost and improve resource utilization. However, traditional approaches designed for big data or high performance computing workloads can not support DL workloads to fully utilize the GPU resources. Recently, substantial schedulers are proposed to tailor for DL workloads in GPU datacenters. This paper surveys existing research efforts for both training and inference workloads. We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource consumption features. Finally, we prospect several promising future research directions. More detailed summary with the surveyed paper and code links can be found at our project website: https://github.com/S-Lab-System-Group/Awesome-DL-Scheduling-Papers. CCS Concepts: • General and reference → Surveys and overviews; • Computing methodologies → Machine learning; • Computer systems organization → Cloud computing.

show abstract

Section: Heterogeneousmentioning

confidence: 99%

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Gao¹,

Hu²,

Ye³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…2 However, in practice, it is not uncommon to further improve SRPT's fairness by some starvation mitigation mechanism. For example, Mangharam et al observe that SRPT may cause unfairness in multimedia transmission [22], and some SRPT-based schedulers have starvation mitigation mechanisms [22,27,10,21,11,23].…”

Section: Motivation: the Need To Mitigate Starvation For Srpt In Prac...mentioning

confidence: 99%

Online Starvation Mitigation to Balance Average Flow Time and Fairness

Kuo¹

2021

Preprint

View full text Add to dashboard Cite

In job scheduling, it is well known that Shortest Remaining Processing Time (SRPT) minimizes the average flow time. However, SRPT may cause starvation and unfairness. To balance fairness and average flow time, one common approach is to minimize the ℓ 2 norm of flow time. All non-trivial algorithms designed for this problem are offline algorithms based on linear programming rounding. For the online setting, all previous works consider standard scheduling algorithms under the assumptions of speed augmentation or certain input distributions. In their seminal paper, Bansal and Pruhs prove that under speed augmentation, fairness is not sacrificed much when SRPT is used [SICOMP 2010]. However, in practice, to achieve better fairness, it is not uncommon to complement SRPT with some starvation mitigation mechanism.Nonetheless, starvation mitigation inevitably destroys SRPT's optimality in minimizing the average flow time. Thus, it is not clear whether starvation mitigation can improve SRPT's performance on minimizing the ℓ 2 norm of flow time. In this paper, we answer this question in the affirmative. Let n be the number of jobs. We use an estimate of n to carefully mitigate the starvation caused by SRPT. Given a good estimate of n, our starvation mitigation mechanism reduces the competitive ratio of SRPT for the ℓ 2 norm of flow time from Ω(n 1 2 ) to O(n 1 3 ). Finally, we remark that all the online algorithms considered previously for this problem have competitive ratios Ω(n 1 2 ).

show abstract

“…In order to achieve linear scale-out, literature focuses on modifying the all-reduce architecture to hierarchical allreduce [40], and developing network-based systems such as in-network aggregation to overcome communication bottlenecks [18], [19], flow schedulers [17] or tailored topologies [22], [23]. These works revolve around network traffic of DML but focus on point solutions for specific frameworks or communication patterns.…”

Section: Related Workmentioning

confidence: 99%

“…Motivated by these challenges, the networking community has started making great efforts to support DML workloads. Proposed solutions range from tailored flow scheduling [16], [17], via in-network aggregation of gradients leveraging programmable data-planes [18]- [20], to specifically crafted highbandwidth topologies [21]- [23]. While the variety of (dynamic) communication patterns suggests heterogenity of the resulting network traffic, the network-level improvements assume one particular pattern, e.g., parameter-server [23] or ring-reduce [21].…”

Section: Introductionmentioning

confidence: 99%