Junguk Cho scite author profile

Junguk Cho

2Publications

2Citation Statements Received

10Citation Statements Given

How they've been cited

How they cite others

Affiliations

Hewlett-Packard (United States)

Publications

Order By: Most citations

Spatial Sharing of GPU for Autotuning DNN models

Dhakal¹,

Cho²,

Kulkarni³

et al. 2020

Preprint

View full text Add to dashboard Cite

GPUs are used for training, inference, and tuning the machine learning models. However, Deep Neural Network (DNN) models vary widely in their ability to exploit the full power of high-performance GPUs. Spatial sharing of GPU enables multiplexing several applications on the GPU and can improve utilization of the GPU, thus improving throughput and lowering latency. DNN models given just the right amount of GPU resources can still provide low inference latency, just as much as dedicating all of the GPU for their inference task. An approach to improve DNN inference performance is hardware-specific tuning of the DNN model. Autotuning frameworks find the optimal low-level implementation for a certain target device based on the trained machine learning model, thus reducing the DNN's inference latency and increasing inference throughput. We observe an inter-dependency between the tuned model and its inference latency. A DNN model tuned with specific GPU resources provides the best inference latency when inferred with close to the same amount of GPU resources. However, a model tuned with the maximum amount of the GPU's resources has poorer inference latency once the GPU resources are limited for inference. On the other hand, a model tuned with an appropriate amount of GPU resources still achieves good inference latency across a wide range of GPU resource availability. We explore the underlying causes that impact the tuning of a model at different amounts of GPU resources. We present a number of techniques to maximize resource utilization and improve tuning performance. We enable controlled spatial sharing of GPU to multiplex several tuning applications on the GPU. We scale the tuning server instances and shard the tuning model across multiple client instances for concurrent tuning of different operators of a model, achieving better GPU multiplexing. With our improvements, we decrease DNN autotuning time by upto 75% and increase throughput by a factor of 5.Preprint. Under review.

show abstract

Slice-Tune

Dhakal

Ramakrishnan

Kulkarni

et al. 2022

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Junguk Cho

Spatial Sharing of GPU for Autotuning DNN models

Slice-Tune

Contact Info

Product

Resources

About