Dynamic Space-Time Scheduling for GPU Inference

Jain, Parul; Mo, Xiangxi; Jain, Ajay; Harikaran, Subbaraj,; Sohail, Durrani, Rehan; Tumanov, Alexey; Stoica, Ion

doi:10.48550/arxiv.1901.00041

Cited by 10 publications

(11 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This approach is set to improve on the low utilization and scaling performances of unshared access to a GPU. That idea of GPU sharing can be promising as seen in [13], where authors studied the performance of temporal and spacial GPU sharing and [14], which presented a GPU cluster manager enabling GPU sharing for DL jobs.…”

Section: G Discussion On Resource Sharingmentioning

confidence: 99%

See 1 more Smart Citation

Modeling of Deep Neural Network (DNN) Placement and Inference in Edge Computing

Bensalem¹,

Dizdarević²,

Jukan³

2020

Preprint

View full text Add to dashboard Cite

With the edge computing becoming an increasingly adopted concept in system architectures, it is expected its utilization will be additionally heightened when combined with deep learning (DL) techniques. The idea behind integrating demanding processing algorithms in Internet of Things (IoT) and edge devices, such as Deep Neural Network (DNN), has in large measure benefited from the development of edge computing hardware, as well as from adapting the algorithms for use in resource constrained IoT devices. Surprisingly, there are no models yet to optimally place and use machine learning in edge computing. In this paper, we propose the first model of optimal placement of Deep Neural Network (DNN) Placement and Inference in edge computing. We present a mathematical formulation to the DNN Model Variant Selection and Placement (MVSP) problem considering the inference latency of different model-variants, communication latency between nodes, and utilization cost of edge computing nodes. We evaluate our model numerically, and show that for low load increasing model colocation decreases the average latency by 33% of millisecondscale per request, and for high load, by 21%.

show abstract

Section: G Discussion On Resource Sharingmentioning

confidence: 99%

“…where α denotes weight of the average latency in the objective function. The first constraint in the problem (13) insures that a request from a specific IoT node can be processed only by one edge node. Constraint (4) assures that RTT cannot exceed the maximum tolerated latency.…”

Section: E Problem Formulationmentioning

confidence: 99%

Modeling of Deep Neural Network (DNN) Placement and Inference in Edge Computing

Bensalem¹,

Dizdarević²,

Jukan³

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…However, this approach does not consider the microarchitectural interactions of the NVIDIA scheduling hierarchy such as the thread block scheduler which, as we have demonstrated, impact the performance of concurrent workloads. Preliminary work conducted by Jain et al on deep learning inference-only workloads suggests that combining spatial and temporal multitasking may outperform both in isolation [12]. We discussed this possibility further in Section 5.…”

Section: Related Workmentioning

confidence: 96%

“…We observed that such workloads have fluctuating resource requirements, variable kernel runtimes, and sequential kernel launches, and unpredictable arrival times. Previously proposed thread-block-level scheduling policies [2,12,20,25,28,29] focus only on more generic workloads which do not possess such characteristics. Finally, we add to previous understandings of the CUDA scheduling hierarchy and its concurrency mechanisms [3,6,16,23].…”

Section: Introductionmentioning

confidence: 99%

Characterizing Concurrency Mechanisms for NVIDIA GPUs under Deep Learning Workloads

Gilman¹,

Walls²

2021

Preprint

View full text Add to dashboard Cite

We investigate the performance of the concurrency mechanisms available on NVIDIA's new Ampere GPU microarchitecture under deep learning training and inference workloads. In contrast to previous studies that treat the GPU as a black box, we examine scheduling at the microarchitectural level. We find that the lack of fine-grained preemption mechanisms, robust task prioritization options, and contention-aware thread block placement policies limits the effectiveness of NVIDIA's concurrency mechanisms. In summary, the sequential nature of deep learning workloads and their fluctuating resource requirements and kernel runtimes make executing such workloads while maintaining consistently high utilization and low, predictable turnaround times difficult on current NVIDIA hardware.

show abstract

“…The typical usage of the GPU as a single resource leads to an under-utilization of its computing power. This has been shown in many works in the Neural Network literature [7] and in different well-known benchmarks [12] such as Parboil [13] or Rodinia [4]. Resource under-utilization and unpredictability are further exacerbated as the number of available computing clusters in a GPU increases.…”

Section: Introductionmentioning

confidence: 95%

Contention-Aware GPU Partitioning and Task-to-Partition Allocation for Real-Time Workloads

Zahaf¹,

Olmedo²,

Singh³

et al. 2021

Preprint

View full text Add to dashboard Cite

In order to satisfy timing constraints, modern real-time applications require massively parallel accelerators such as General Purpose Graphic Processing Units (GPGPUs). Generation after generation, the number of computing clusters made available in novel GPU architectures is steadily increasing, hence, investigating suitable scheduling approaches is now mandatory. Such scheduling approaches are related to mapping different and concurrent compute kernels within the GPU computing clusters, hence grouping GPU computing clusters into schedulable partitions. In this paper we propose novel techniques to define GPU partitions; this allows us to define suitable task-to-partition allocation mechanisms in which tasks are GPU compute kernels featuring different timing requirements. Such mechanisms will take into account the interference that GPU kernels experience when running in overlapping time windows. Hence, an effective and simple way to quantify the magnitude of such interference is also presented. We demonstrate the efficiency of the proposed approaches against the classical techniques that considered the GPU as a single, nonpartitionable resource.

show abstract

Dynamic Space-Time Scheduling for GPU Inference

Cited by 10 publications

References 0 publications

Modeling of Deep Neural Network (DNN) Placement and Inference in Edge Computing

Modeling of Deep Neural Network (DNN) Placement and Inference in Edge Computing

Characterizing Concurrency Mechanisms for NVIDIA GPUs under Deep Learning Workloads

Contention-Aware GPU Partitioning and Task-to-Partition Allocation for Real-Time Workloads

Contact Info

Product

Resources

About