RIFLING: A reinforcement learning‐based GPU scheduler for deep learning research and development platforms

Chen, Zhaoyun

doi:10.1002/spe.3066

Cited by 7 publications

(4 citation statements)

References 28 publications

(60 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A number of works apply RL to optimize the elastic training policy. Specifically, RIFLING [18] adopts K-means to divide concurrent jobs into several groups based on the computationcommunication ratio similarity. The group operation reduces the state space and accelerates the convergence speed of the RL model.…”

Section: Elastic Trainingmentioning

confidence: 99%

“…Each stage requires high-grade hardware resources (GPU and other compute systems) to produce and serve productionlevel DL models [62,71,106,149]. Therefore it becomes prevalent for IT industries [62,149] and research institutes [18,19,71] to set up GPU datacenters to meet their ever-growing DL development demands. A GPU datacenter possesses large amounts of heterogeneous compute resources to host large amounts of DL workloads.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Gao¹,

Hu²,

Ye³

et al. 2022

Preprint

View full text Add to dashboard Cite

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU datacenter. An efficient scheduler design for such GPU datacenter is crucially important to reduce the operational cost and improve resource utilization. However, traditional approaches designed for big data or high performance computing workloads can not support DL workloads to fully utilize the GPU resources. Recently, substantial schedulers are proposed to tailor for DL workloads in GPU datacenters. This paper surveys existing research efforts for both training and inference workloads. We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource consumption features. Finally, we prospect several promising future research directions. More detailed summary with the surveyed paper and code links can be found at our project website: https://github.com/S-Lab-System-Group/Awesome-DL-Scheduling-Papers. CCS Concepts: • General and reference → Surveys and overviews; • Computing methodologies → Machine learning; • Computer systems organization → Cloud computing.

show abstract

Section: Elastic Trainingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Gao¹,

Hu²,

Ye³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Recently, since reinforcement learning has shown good performances in making sequential decisions, it has been applied to solve the resource scheduling problem of the computing cluster 19,20,21 . In this paper, we introduce the advantage of reinforcement learning to the Kubernetes scheduling and propose DRS, a Deep Reinforcement learning based Kubernetes Scheduler.…”

Section: Introductionmentioning

confidence: 99%

DRS: A Deep Reinforcement Learning enhanced Kubernetes Scheduler for Microservice-based System

Jian

Xie

Fang

et al. 2023

Preprint

View full text Add to dashboard Cite

Recently, Kubernetes is widely used to manage and schedule the resources of microservices in cloud-native distributed applications, as the most famous container orchestration framework. However, Kubernetes preferentially schedules microservices to nodes with rich and balanced CPU and memory resources on a single node. The native scheduler of Kubernetes, called Kube-scheduler, may cause resource fragmentation and decrease resource utilization. In this paper, we propose a deep reinforcement learning enhanced Kubernetes scheduler named DRS. To improve resource utilization and reduce load imbalance, we first present the Kubernetes scheduling problem as a Markov decision process and elaborately designed the state, action, and reward. Then, we design and implement DRS mointor to perceive six metrics about resource utilization to construct a comprehensive global resource view. Finally, DRS can automatically learn the scheduling policy through interaction with the Kubernetes cluster, without relying on expert knowledge about workload and cluster status. We implement a prototype of DRS in a Kubernetes cluster with five nodes and evaluate its performance. Experimental results highlight that DRS overcomes the shortcomings of Kube-scheduler and achieve the expected scheduling target with three workloads. Compared with Kube-scheduler, DRS brings an improvement of 27.29% in resource utilization and reduce the load imbalance by 2 .90× on average, with only 3.27% CPU overhead and 0.648% communication latency.

show abstract

“…Deep learning models are often prepared using free frameworks such as PyTorch, 3 Tensorflow, 4 Keras, 5 and others. There is also a growing body of work focusing on practical implementation aspects, for example, using reinforcement learning to optimize graphics processing unit (GPU) allocation in deep learning research 6 or visualization of model structures via tools such as NN‐SVG, 7 NETRON, 8 or TensorBoard in the TensorFlow framework 4 . However, the exchange of models, or just the application of models prepared by a third party, is not straightforward in practice.…”

Section: Introductionmentioning

confidence: 99%

DeepPlayer: An open‐source SignalPlant plugin for deep learning inference

et al. 2022

View full text Add to dashboard Cite

Background and Objective: Machine learning has become a powerful tool in several computation domains. The most progressive way of machine learning, deep learning, has already surpassed several algorithms designed by human experts. It also applies to the field of biomedical signal processing. However, while many experts produce deep learning models, there is no software platform for signal processing, allowing the convenient use of pre-trained deep learning models and evaluating them using any inspected signal. This may also hinder understanding, interpretation, and explanation of results. For these reasons, we designed DeepPlayer. It is a plugin for the free signal processing software SignalPlant. The plugin allows loading deep learning models saved in the Open Neural Network Exchange (ONNX) file format and evaluating them on any given signal. Methods:The DeepPlayer plugin and its graphical user interface were designed in C# programming language and the .NET framework. We used the inference library OnnxRuntime, which supports graphics card acceleration. The inference is executed in asynchronous tasks for a live preview and evaluation of the signals. Model outputs can be exported back to SignalPlant for further processing, such as peak detection or thresholding. Results:We developed the DeepPlayer plugin to evaluate deep learning models in SignalPlant.The plugin keeps with SignalPlant's interactive work with signals, such as live preview or easy selection of associated signals. The plugin can load classification or regression models and allows standard pre-processing and post-processing methods. We prepared several deep learning models to test the plugin. Additionally, we provide a tutorial training script that outputs an ONNX format model with correctly set metadata information. These, and the source code of the DeepPlayer plugin, are publicly accessible via GitHub and Google Colab service. Conclusion:The DeepPlayer plugin allows running deep learning models easily and interactively. Therefore, experts and non-AI experts alike can explore and apply deep learning models for (biomedical) signal processing. Its ease of use and interactivity might also contribute to a better understanding and acceptance of AI methods in biomedicine.

show abstract

RIFLING: A reinforcement learning‐based GPU scheduler for deep learning research and development platforms

Cited by 7 publications

References 28 publications

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

DRS: A Deep Reinforcement Learning enhanced Kubernetes Scheduler for Microservice-based System

DeepPlayer: An open‐source SignalPlant plugin for deep learning inference

Contact Info

Product

Resources

About