Deadline-Aware Offloading for High-Throughput Accelerators

Yeh, Tsung Tai; Sinclair, Matthew; Beckmann, Bradford M.; Rogers, Timothy G.

doi:10.1109/hpca51647.2021.00048

Cited by 10 publications

(1 citation statement)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This enables us to derive a simpler but more effective solution that leverages DNNspecific features (e.g., characteristics of each layer). Several QoS-aware GPU schedulers in a datacenter environment [10], [11] perform task prioritization based on estimated QoS slack time. However, their estimation is less precise in an NPU setting for not utilizing data fetch and Copyright c 2022 The Institute of Electronics, Information and Communication Engineers compute time, which can be readily estimated on NPUs with high precision.…”

Section: Introductionmentioning

confidence: 99%

Layerweaver+: A QoS-Aware Layer-Wise DNN Scheduler for Multi-Tenant Neural Processing Units

Jin

Ham

et al. 2022

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

Many cloud service providers employ specialized hardware accelerators, called neural processing units (NPUs), to accelerate deep neural networks (DNNs). An NPU scheduler is responsible for scheduling incoming user requests and required to satisfy the two, often conflicting, optimization goals: maximizing system throughput and satisfying quality-of-service (QoS) constraints (e.g., deadlines) of individual requests. We propose Layerweaver+, a low-cost layer-wise DNN scheduler for NPUs, which provides both high system throughput and minimal QoS violations. For a serving scenario based on the industry-standard MLPerf inference benchmark, Layerweaver+ significantly improves the system throughput by up to 266.7% over the baseline scheduler serving one DNN at a time.

show abstract

Section: Introductionmentioning

confidence: 99%