The increasing demand for cloud-based inference services requires the use of Graphics Processing Unit (GPU). It is highly desirable to utilize GPU eciently by multiplexing dierent inference tasks on the GPU. Batched processing, CUDA streams and Multi-process-service (MPS) help. However, we nd that these are not adequate for achieving scalability by eciently utilizing GPUs, and do not guarantee predictable performance. GSLICE addresses these challenges by incorporating a dynamic GPU resource allocation and management framework to maximize performance and resource utilization. We virtualize the GPU by apportioning the GPU resources across dierent Inference Functions (IFs), thus providing isolation and guaranteeing performance. We develop self-learning and adaptive GPU resource allocation and batching schemes that account for network trac characteristics, while also keeping inference latencies below service level objectives. GSLICE adapts quickly to the streaming data's workload intensity and the variability of GPU processing costs. GSLICE provides scalability of the GPU for IF processing through ecient and controlled spatial multiplexing, coupled with a GPU resource reallocation scheme with near-zero (< 100`s) downtime. Compared to default MPS and TensorRT, GSLICE improves GPU utilization eciency by 60-800% and achieves 2-13⇥ improvement in aggregate throughput. CCS CONCEPTS • Computer systems organization ! Cloud computing; • Computing methodologies ! Machine learning.
GPUs are used for training, inference, and tuning the machine learning models. However, Deep Neural Network (DNN) models vary widely in their ability to exploit the full power of high-performance GPUs. Spatial sharing of GPU enables multiplexing several applications on the GPU and can improve utilization of the GPU, thus improving throughput and lowering latency. DNN models given just the right amount of GPU resources can still provide low inference latency, just as much as dedicating all of the GPU for their inference task. An approach to improve DNN inference performance is hardware-specific tuning of the DNN model. Autotuning frameworks find the optimal low-level implementation for a certain target device based on the trained machine learning model, thus reducing the DNN's inference latency and increasing inference throughput. We observe an inter-dependency between the tuned model and its inference latency. A DNN model tuned with specific GPU resources provides the best inference latency when inferred with close to the same amount of GPU resources. However, a model tuned with the maximum amount of the GPU's resources has poorer inference latency once the GPU resources are limited for inference. On the other hand, a model tuned with an appropriate amount of GPU resources still achieves good inference latency across a wide range of GPU resource availability. We explore the underlying causes that impact the tuning of a model at different amounts of GPU resources. We present a number of techniques to maximize resource utilization and improve tuning performance. We enable controlled spatial sharing of GPU to multiplex several tuning applications on the GPU. We scale the tuning server instances and shard the tuning model across multiple client instances for concurrent tuning of different operators of a model, achieving better GPU multiplexing. With our improvements, we decrease DNN autotuning time by upto 75% and increase throughput by a factor of 5.Preprint. Under review.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.