2017
DOI: 10.1051/itmconf/20171203030
|View full text |Cite
|
Sign up to set email alerts
|

DLTAP: A Network-efficient Scheduling Method for Distributed Deep Learning Workload in Containerized Cluster Environment

Abstract: Abstract:Deep neural networks (DNNs) have recently yielded strong results on a range of applications. Training these DNNs using a cluster of commodity machines is a promising approach since training is time consuming and compute-intensive. Furthermore, putting DNN tasks into containers of clusters would enable broader and easier deployment of DNN-based algorithms. Toward this end, this paper addresses the problem of scheduling DNN tasks in the containerized cluster environment. Efficiently scheduling data-para… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 8 publications
0
2
0
Order By: Relevance
“…Differently, Pollux [7] models how the goodput (a custom metric encompassing throughput and training efficiency) changes by adding or removing resources in homogeneous GPU-based systems. It profiles each job and dynamically tunes the batch size, learning rate and number of assigned GPUs.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Differently, Pollux [7] models how the goodput (a custom metric encompassing throughput and training efficiency) changes by adding or removing resources in homogeneous GPU-based systems. It profiles each job and dynamically tunes the batch size, learning rate and number of assigned GPUs.…”
Section: Related Workmentioning
confidence: 99%
“…This work aims to optimize the resource selection and scheduling of DL training jobs from the perspective of a CSP running a data center, efficiently selecting resources for the execution of each job to minimize the energy consumption costs while meeting the applications' due dates. To the best of our knowledge, methods available in the literature tackle this problem by proposing (i) simple job scheduling mechanisms, such as Earliest-Deadline-First (EDF) or First-in-First-Out (FIFO) [3]- [6], possibly coupled with effective resource selection algorithms, or (ii) more elaborate heuristics [7]- [9], sometimes even coupling the resource selection and scheduling problem as in our previous works [10], [11]. Albeit achieving good-quality solutions, these approaches only consider the worst-case execution times of DL applications in searching for an optimal schedule.…”
Section: Introductionmentioning
confidence: 99%