Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems 2022
DOI: 10.1145/3503222.3507752
|View full text |Cite
|
Sign up to set email alerts
|

VELTAIR: towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling

Abstract: Deep learning (DL) models have achieved great success in many application domains. As such, many industrial companies such as Google and Facebook have acknowledged the importance of multitenant DL services. Although the multi-tenant service has been studied in conventional workloads, it is not been deeply studied on deep learning service, especially on general-purpose hardware.In this work, we systematically analyze the opportunities and challenges of providing multi-tenant deep learning services on the genera… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 30 publications
(3 citation statements)
references
References 58 publications
0
3
0
Order By: Relevance
“…The quadratic nature of attention mechanisms has led to a substantial surge in memory consumption for LLMs, thereby magnifying the significance of effective memory management [8,16,58]. Researchers have proposed various algorithmic optimizations aimed at curbing memory consumption, including quantization techniques [22, 24-26, 32, 46, 47, 81, 94], pruning strategies [19-21, 27, 39, 52, 64, 65, 80, 93], and KV-cache compression approaches [5,37], compilation [9,34,[90][91][92] and scheduling [11,23,50,51,54,85].…”
Section: Related Work and Discussionmentioning
confidence: 99%
“…The quadratic nature of attention mechanisms has led to a substantial surge in memory consumption for LLMs, thereby magnifying the significance of effective memory management [8,16,58]. Researchers have proposed various algorithmic optimizations aimed at curbing memory consumption, including quantization techniques [22, 24-26, 32, 46, 47, 81, 94], pruning strategies [19-21, 27, 39, 52, 64, 65, 80, 93], and KV-cache compression approaches [5,37], compilation [9,34,[90][91][92] and scheduling [11,23,50,51,54,85].…”
Section: Related Work and Discussionmentioning
confidence: 99%
“…There are some prior research on optimizing the operator scheduling of DNN models to improve the quality of model service [13], [25], [26], [27]. REEF [25] apopts a parallel mechanism based on dynamic kernel padding to improve the overall throughput.…”
Section: B Operator-level Dnn Inference Servicementioning
confidence: 99%
“…REEF [25] apopts a parallel mechanism based on dynamic kernel padding to improve the overall throughput. VELTAIR [27] proposed an adaptive operator-level compilation and scheduling to guarantee resource usage efficiency and reduce interference-induced performance loss for multi-tenant DNN services. PREMA [13] is a predictive multi-task scheduling algorithm for preemptible neural processing unit to meet high-throughput.…”
Section: B Operator-level Dnn Inference Servicementioning
confidence: 99%