2022
DOI: 10.1109/tcc.2020.3040312
|View full text |Cite
|
Sign up to set email alerts
|

Efficient Online Scheduling for Coflow-Aware Machine Learning Clusters

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 14 publications
(4 citation statements)
references
References 37 publications
0
4
0
Order By: Relevance
“…Specifically, it adopts a random forest model to predict job resource requirements and then uses the best-fit algorithm and grouping genetic algorithm to optimize the execution performance of DL jobs. Parrot [88] is a framework to manage network bandwidth contention among training jobs using the PS architecture. The communication scheme in a PS workload exhibits a coflow chain dependency where the event of parameter-pull happens after the event of parameter-push.…”
Section: Heterogeneousmentioning
confidence: 99%
“…Specifically, it adopts a random forest model to predict job resource requirements and then uses the best-fit algorithm and grouping genetic algorithm to optimize the execution performance of DL jobs. Parrot [88] is a framework to manage network bandwidth contention among training jobs using the PS architecture. The communication scheme in a PS workload exhibits a coflow chain dependency where the event of parameter-pull happens after the event of parameter-push.…”
Section: Heterogeneousmentioning
confidence: 99%
“…2 However, in practice, it is not uncommon to further improve SRPT's fairness by some starvation mitigation mechanism. For example, Mangharam et al observe that SRPT may cause unfairness in multimedia transmission [22], and some SRPT-based schedulers have starvation mitigation mechanisms [22,27,10,21,11,23].…”
Section: Motivation: the Need To Mitigate Starvation For Srpt In Prac...mentioning
confidence: 99%
“…In order to achieve linear scale-out, literature focuses on modifying the all-reduce architecture to hierarchical allreduce [40], and developing network-based systems such as in-network aggregation to overcome communication bottlenecks [18], [19], flow schedulers [17] or tailored topologies [22], [23]. These works revolve around network traffic of DML but focus on point solutions for specific frameworks or communication patterns.…”
Section: Related Workmentioning
confidence: 99%
“…Motivated by these challenges, the networking community has started making great efforts to support DML workloads. Proposed solutions range from tailored flow scheduling [16], [17], via in-network aggregation of gradients leveraging programmable data-planes [18]- [20], to specifically crafted highbandwidth topologies [21]- [23]. While the variety of (dynamic) communication patterns suggests heterogenity of the resulting network traffic, the network-level improvements assume one particular pattern, e.g., parameter-server [23] or ring-reduce [21].…”
Section: Introductionmentioning
confidence: 99%