2021
DOI: 10.1002/spe.3066
|View full text |Cite
|
Sign up to set email alerts
|

RIFLING: A reinforcement learning‐based GPU scheduler for deep learning research and development platforms

Abstract: GPU platforms have been widely adopted in both academia and industry to support deep learning (DL) research and development (R&D). Compared with giant companies who favor custom-designed AI platforms, most small-and-medium-sized enterprises, institutes and universities (EIUs) prefer to build or rent a cost-effective GPU cluster, usually in a limited-scale, to process diverse DL R&D workloads. Therefore, more attention has been attracted by DL scheduling with the aim of improving the system efficiency and task … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(4 citation statements)
references
References 28 publications
(60 reference statements)
0
4
0
Order By: Relevance
“…A number of works apply RL to optimize the elastic training policy. Specifically, RIFLING [18] adopts K-means to divide concurrent jobs into several groups based on the computationcommunication ratio similarity. The group operation reduces the state space and accelerates the convergence speed of the RL model.…”
Section: Elastic Trainingmentioning
confidence: 99%
See 1 more Smart Citation
“…A number of works apply RL to optimize the elastic training policy. Specifically, RIFLING [18] adopts K-means to divide concurrent jobs into several groups based on the computationcommunication ratio similarity. The group operation reduces the state space and accelerates the convergence speed of the RL model.…”
Section: Elastic Trainingmentioning
confidence: 99%
“…Each stage requires high-grade hardware resources (GPU and other compute systems) to produce and serve productionlevel DL models [62,71,106,149]. Therefore it becomes prevalent for IT industries [62,149] and research institutes [18,19,71] to set up GPU datacenters to meet their ever-growing DL development demands. A GPU datacenter possesses large amounts of heterogeneous compute resources to host large amounts of DL workloads.…”
Section: Introductionmentioning
confidence: 99%
“…Recently, since reinforcement learning has shown good performances in making sequential decisions, it has been applied to solve the resource scheduling problem of the computing cluster 19,20,21 . In this paper, we introduce the advantage of reinforcement learning to the Kubernetes scheduling and propose DRS, a Deep Reinforcement learning based Kubernetes Scheduler.…”
Section: Introductionmentioning
confidence: 99%
“…Deep learning models are often prepared using free frameworks such as PyTorch, 3 Tensorflow, 4 Keras, 5 and others. There is also a growing body of work focusing on practical implementation aspects, for example, using reinforcement learning to optimize graphics processing unit (GPU) allocation in deep learning research 6 or visualization of model structures via tools such as NN‐SVG, 7 NETRON, 8 or TensorBoard in the TensorFlow framework 4 . However, the exchange of models, or just the application of models prepared by a third party, is not straightforward in practice.…”
Section: Introductionmentioning
confidence: 99%