Efficient Scheduling of Weighted Coflows in Data Centers

Wang, Zhiliang; Zhang, Han; Shi, Xingang; Yin, Xia; Li, Yahui; Geng, Hua; Wu, Qianhong; Liu, Jianwei

doi:10.1109/tpds.2019.2905560

Cited by 27 publications

(6 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Unless otherwise specified, we choose the number of ports P = 20, the average job arrival rate λ = 20 and the average number of coflows n = 8. In the experiment, we compare our algorithm with DeepWeave, both based on DRL and Varys [6] and IAOA [33] that were also used as popular non-ML baselines for comparison with DeepWeave [1]. In addition, we add an ablation experiment to compare the performance of the retrained model under the same settings, removing the self-attention layer.…”

Section: Simulation Resultsmentioning

confidence: 99%

See 1 more Smart Citation

A Scalable Deep Reinforcement Learning Model for Online Scheduling Coflows of Multi-Stage Jobs for High Performance Computing

Wang¹,

Shen²

2021

Preprint

View full text Add to dashboard Cite

Coflow is a recently proposed networking abstraction to help improve the communication performance of data-parallel computing jobs. In multi-stage jobs, each job consists of multiple coflows and is represented by a Directed Acyclic Graph (DAG). Efficiently scheduling coflows is critical to improve the data-parallel computing performance in data centers. Compared with hand-tuned scheduling heuristics, existing work DeepWeave [1] utilizes Reinforcement Learning (RL) framework to generate highly-efficient coflow scheduling policies automatically. It employs a graph neural network (GNN) to encode the job information in a set of embedding vectors, and feeds a flat embedding vector containing the whole job information to the policy network. However, this method has poor scalability as it is unable to cope with jobs represented by DAGs of arbitrary sizes and shapes, which requires a large policy network for processing a high-dimensional embedding vector that is difficult to train. In this paper, we first utilize a directed acyclic graph neural network (DAGNN) to process the input and propose a novel Pipelined-DAGNN, which can effectively speed up the feature extraction process of the DAGNN. Next, we feed the embedding sequence composed of schedulable coflows instead of a flat embedding of all coflows to the policy network, and output a priority sequence, which makes the size of the policy network depend on only the dimension of features instead of the product of dimension and number of nodes in the job's DAG. Furthermore, to improve the accuracy of the priority scheduling policy, we incorporate the Self-Attention Mechanism into a deep RL model to capture the interaction between different parts of the embedding sequence to make the output priority scores relevant. Based on this model, we then develop a coflow scheduling algorithm for online multi-stage jobs. Our simulation results are based on the real trace of Facebook. Compared with a state-of-the-art approach, our model can shorten the average weighted job completion time by up to 40.42% and complete jobs at least 1.68 times faster. It also has better scalability and robustness.

show abstract

Section: Simulation Resultsmentioning

confidence: 99%

“…2. IAOA (Information-Agnostic Online Algorithm) [33] formulates the weighted coflow completion time minimiza-tion problem and proposes a heuristic solution with an approximation factor of 2 to the optimal solution. However, IAOA did not consider the dependency between coflows in job DAGs 3.…”

Section: Simulation Settingsmentioning

confidence: 99%

A Scalable Deep Reinforcement Learning Model for Online Scheduling Coflows of Multi-Stage Jobs for High Performance Computing

Wang¹,

Shen²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The network topology can be represented by an undirected connected graph [24][25][26] G = (V , E, W ). Where V , E, W respectively represents the set of nodes, the set of links, and the set of costs of links in the network, E is the complement of E. For link (u, v) ∈ E in the network, w (u, v) represents the cost of the link.…”

Section: Network Model and Problem Description 21 Network Modelmentioning

confidence: 99%

Efficient Routing Protection Algorithm Based on Optimized Network Topology

Geng¹,

Jin²,

Yao³

et al. 2022

Computers, Materials &Amp; Continua

Self Cite

View full text Add to dashboard Cite

Network failures are unavoidable and occur frequently. When the network fails, intra-domain routing protocols deploying on the Internet need to undergo a long convergence process. During this period, a large number of messages are discarded, which results in a decline in the user experience and severely affects the quality of service of Internet Service Providers (ISP). Therefore, improving the availability of intra-domain routing is a trending research question to be solved. Industry usually employs routing protection algorithms to improve intra-domain routing availability. However, existing routing protection schemes compute as many backup paths as possible to reduce message loss due to network failures, which increases the cost of the network and impedes the methods deployed in practice. To address the issues, this study proposes an efficient routing protection algorithm based on optimized network topology (ERPBONT). ERPBONT adopts the optimized network topology to calculate a backup path with the minimum path coincidence degree with the shortest path for all source purposes. Firstly, the backup path with the minimum path coincidence with the shortest path is described as an integer programming problem. Then the simulated annealing algorithm ERPBONT is used to find the optimal solution. Finally, the algorithm is tested on the simulated topology and the real topology. The experimental results show that ERPBONT effectively reduces the path coincidence between the shortest path and the backup path, and significantly improves the routing availability.

show abstract

“…From an another angle, the flaws of coflow have also been discussed recently. In [28], the problem of scheduling weighted coflows is addressed, where weights are used to express the importances of different coflows. Tian et al argue that there are dependencies among coflows in the context of multistage jobs and propose an approximation algorithm [29].…”

Section: B Finding New Situationsmentioning

confidence: 99%

Application-Oriented Network Scheduling With Metaflow

et al. 2019

View full text Add to dashboard Cite

Distributed applications usually feature a set of correlated flows between two consecutive computation stages. The scheduling of these flows has a crucial influence on job completion time. Coflow improves performance by optimizing the finish time of the entire set of flows. However, the flows and computing tasks in one application have more complex relationships that exceed the coflow's barrier assumption. In this context, scheduling via coflow abstraction may hurt application performance. Accordingly, we propose metaflow, a traffic abstraction derived from the computation graph of the application. Metaflow reveals the detailed flow requirements of the application and makes it easier to reduce the job completion time. Based on the metaflow, we first develop a mathematical model and formulate the scheduling problem as an integer linear programming (ILP) problem. We further prove that it has an equivalent linear programming (LP) problem through rigorous theoretical analysis in order to solve this ILP problem efficiently. To demonstrate the effectiveness of scheduling with metaflow, we have conducted extensive simulations with both synthetic single jobs and production traces containing multiple jobs. The simulation results verify that our new scheduler adapts well to different jobs and can achieve a significant increase in an average speed of 2.87× on a real-life workload, compared to the state-of-the-art coflow scheduler.INDEX TERMS Datacenter networking, distributed applications, network scheduling.

show abstract

Efficient Scheduling of Weighted Coflows in Data Centers

Cited by 27 publications

References 41 publications

A Scalable Deep Reinforcement Learning Model for Online Scheduling Coflows of Multi-Stage Jobs for High Performance Computing

A Scalable Deep Reinforcement Learning Model for Online Scheduling Coflows of Multi-Stage Jobs for High Performance Computing

Efficient Routing Protection Algorithm Based on Optimized Network Topology

Application-Oriented Network Scheduling With Metaflow

Contact Info

Product

Resources

About