Interference-aware parallelization for deep learning workload in GPU cluster

Geng, Xin; Zhang, Haitao; Zhao, Zhengyang; Ma, Huadóng

doi:10.1007/s10586-019-03037-6

Cited by 19 publications

(8 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, GPUs can process a large number of data points in parallel which leads to higher computational throughput. Training deep learning models is computationally expensive and time consuming process due to the need for a tremendous volume of data, and leveraging scalable computation resources can speed up the training process significantly [15,24]. Recently, research effort has been focused on speeding up the training process.…”

Section: Introductionmentioning

confidence: 99%

DLBench: a comprehensive experimental evaluation of deep learning frameworks

et al. 2021

View full text Add to dashboard Cite

Deep Learning (DL) has achieved remarkable progress over the last decade on various tasks such as image recognition, speech recognition, and natural language processing. In general, three main crucial aspects fueled this progress: the increasing availability of large amount of digitized data, the increasing availability of affordable parallel and powerful computing resources (e.g., GPU) and the growing number of open source deep learning frameworks that facilitate and ease the development process of deep learning architectures. In practice, the increasing popularity of deep learning frameworks calls for benchmarking studies that can effectively evaluate and understand the performance characteristics of these systems. In this paper, we conduct an extensive experimental evaluation and analysis of six popular deep learning frameworks, namely, TensorFlow, MXNet, PyTorch, Theano, Chainer, and Keras, using three types of DL architectures Convolutional Neural Networks (CNN), Faster Region-based Convolutional Neural Networks (Faster R-CNN), and Long Short Term Memory (LSTM). Our experimental evaluation considers different aspects for its comparison including accuracy, training time, convergence and resource consumption patterns. Our experiments have been conducted on both CPU and GPU environments using different datasets. We report and analyze the performance characteristics of the studied frameworks. In addition, we report a set of insights and important lessons that we have learned from conducting our experiments.

show abstract

Section: Introductionmentioning

confidence: 99%

DLBench: a comprehensive experimental evaluation of deep learning frameworks

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Typically there is no significant performance degradation until the overall CPU utilization in a server approaches the number of physical cores. The reason is that in the architecture of a physical CPU, L1/L2 cache is isolated for each physical core [9]. The contention of cache occurs when demands exceed the number of physical cores, which is common since one physical core is typically virtualized into two logical cores for DL developers (through Hyper Thread) [19].…”

Section: B Modelmentioning

confidence: 99%

“…Some white-box studies build explicit interference models to predict performance slowdown of co-locations, e.g., for MapReduce tasks [7], VM tasks [8] with I/O contention, etc. Other DNN-based approaches use a large amount of historical trace to learn interference levels of co-located ML jobs [9], or equip the scheduler with a reinforcement learning (RL) model to improve job placement policy through explorations and feedback [10] [11].…”

Section: Introductionmentioning

confidence: 99%

Large-scale Machine Learning Cluster Scheduling via Multi-agent Graph Reinforcement Learning

Zhao¹,

Wu²

2021

Preprint

View full text Add to dashboard Cite

Efficient scheduling of distributed deep learning (DL) jobs in large GPU clusters is crucial for resource efficiency and job performance. While server sharing among jobs improves resource utilization, interference among co-located DL jobs occurs due to resource contention. Interference-aware job placement has been studied, with white-box approaches based on explicit interference modeling and black-box schedulers with reinforcement learning. In today's clusters containing thousands of GPU servers, running a single scheduler to manage all arrival jobs in a timely and effective manner is challenging, due to the large workload scale. We adopt multiple schedulers in a largescale cluster/data center, and propose a multi-agent reinforcement learning (MARL) scheduling framework to cooperatively learn fine-grained job placement policies, towards the objective of minimizing job completion time (JCT). To achieve topologyaware placements, our proposed framework uses hierarchical graph neural networks to encode the data center topology and server architecture. In view of a common lack of precise reward samples corresponding to different placements, a job interference model is further devised to predict interference levels in face of various co-locations, for training of the MARL schedulers. Testbed and trace-driven evaluations show that our scheduler framework outperforms representative scheduling schemes by more than 20% in terms of average JCT, and is adaptive to various machine learning cluster topologies.

show abstract

“…In recent years, more paralleled deep learning methods have been brought up [11]. In the aspect of algorithms, several algorithms have been brought up to accelerate multi-GPU implementation or make the inference more accurate [1,26] and faster [7,12]. Moreover, there are researches have been done to integrate DP and MP [8].…”

Section: Multi-gpu Parallel Computingmentioning

confidence: 99%

Multi-task learning based on question–answering style reviews for aspect category classification and aspect term extraction on GPU clusters

Cheng

Wang

et al. 2020

Cluster Comput

View full text Add to dashboard Cite

Cluster computing technologies are rapidly advancing and user-generated online reviews are booming in the current Internet and e-commerce environment. The latest question–answering (Q&A)-style reviews are novel, abundant and easily digestible product reviews that also contain massive valuable information for customers. In this paper, we mine valuable aspect information of products contained in these reviews on GPU clusters. To achieve this goal, we utilize two subtasks of aspect-based sentiment analysis: aspect term extraction (ATE) and aspect category classification (ACC). Most previous works focused on only one task or solved these two tasks separately, even though they are highly interrelated, and they do not make full use of abundant training resources. To address this problem, we propose a novel multi-task neural learning model to jointly handle these two tasks and explore the performance of our model on GPU clusters. We conducted extensive comparative experiments on an annotated corpus and found that our proposed model outperforms several baseline models in ATE and ACC tasks on GPU clusters, yielding significant strides in data mining for these types of reviews.

show abstract

Interference-aware parallelization for deep learning workload in GPU cluster

Cited by 19 publications

References 21 publications

DLBench: a comprehensive experimental evaluation of deep learning frameworks

DLBench: a comprehensive experimental evaluation of deep learning frameworks

Large-scale Machine Learning Cluster Scheduling via Multi-agent Graph Reinforcement Learning

Multi-task learning based on question–answering style reviews for aspect category classification and aspect term extraction on GPU clusters

Contact Info

Product

Resources

About