A Unified Optimization Approach for CNN Model Inference on Integrated GPUs

Wang, Leyuan; Chen, Zhi; Liu, Yizhi; Wang, Yao; Zheng, Lianmin; Wang, Yida

doi:10.1145/3337821.3337839

Cited by 28 publications

(10 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…GPUs are increasingly being used both in training and inference of the CNNs architectures due to their high-performance capability of processing vectored data, making them a perfect fit for CNNs [27,62]. Then, as our case-study scenarios involve edge devices, we looked into energy-efficient GPUs for mobile and edge computing.…”

Section: Performance Comparison -Cpu Vs Gpumentioning

confidence: 99%

Gem5-X

Qureshi

Simon

Zapater

et al. 2021

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

The increasing adoption of smart systems in our daily life has led to the development of new applications with varying performance and energy constraints, and suitable computing architectures need to be developed for these new applications. In this article, we present gem5-X, a system-level simulation framework, based on gem-5, for architectural exploration of heterogeneous many-core systems. To demonstrate the capabilities of gem5-X, real-time video analytics is used as a case-study. It is composed of two kernels, namely, video encoding and image classification using convolutional neural networks (CNNs). First, we explore through gem5-X the benefits of latest 3D high bandwidth memory (HBM2) in different architectural configurations. Then, using a two-step exploration methodology, we develop a new optimized clustered-heterogeneous architecture with HBM2 in gem5-X for video analytics application. In this proposed clustered-heterogeneous architecture, ARMv8 in-order cluster with in-cache computing engine executes the video encoding kernel, giving 20% performance and 54% energy benefits compared to baseline ARM in-order and Out-of-Order systems, respectively. Furthermore, thanks to gem5-X, we conclude that ARM Out-of-Order clusters with HBM2 are the best choice to run visual recognition using CNNs, as they outperform DDR4-based system by up to 30% both in terms of performance and energy savings.

show abstract

Section: Performance Comparison -Cpu Vs Gpumentioning

confidence: 99%

Gem5-X

Qureshi

Simon

Zapater

et al. 2021

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…As a result, it may be hard for a GPU SparseTrain implementation to beat the Tensor Core accelerated GEMM. Nevertheless, the method can be useful on GPUs without a hardware GEMM accelerator (e.g., the integrated GPUs used for inference on edge devices [49]), or when we desire higher precision than the one supported by the accelerator.…”

Section: Generalization To Other Hardwarementioning

confidence: 99%

SparseTrain

Gong¹,

Ji²,

Fletcher³

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

Our community has improved the efficiency of deep learning applications by exploiting sparsity in inputs. Most of that work, though, is for inference, where weight sparsity is known statically, and/or for specialized hardware. In this paper, we propose SparseTrain, a software-only scheme to leverage dynamic sparsity during training on general-purpose SIMD processors. SparseTrain exploits zeros introduced by the ReLU activation function to both feature maps and their gradients. Exploiting such sparsity is challenging because the sparsity degree is moderate and the locations of zeros change over time. SparseTrain identifies zeros in a dense data representation and performs vectorized computation. Variations of the scheme are applicable to all major components of training: forward propagation, backward propagation by inputs, and backward propagation by weights. Our experiments on a 6-core Intel Skylake-X server show that SparseTrain is very effective. In end-to-end training of VGG16, ResNet-34, and ResNet-50 with ImageNet, SparseTrain outperforms a highly-optimized direct convolution on the non-initial convolutional layers by 2.19x, 1.37x, and 1.31x, respectively. SparseTrain also benefits inference. It accelerates the non-initial convolutional layers of the aforementioned models by 1.88x, 1.64x, and 1.44x, respectively. CCS CONCEPTS • Computing methodologies → Neural networks; Shared memory algorithms; Vector / streaming algorithms.

show abstract

“…Besides, constructing a quality template requires expertise in both tensor operators and hardware. It takes non-trivial research efforts [29,50,53] to develop quality templates. Despite the huge efforts in developing templates, existing manual templates only cover limited program structures because manually enumerating all optimization choices for all operators is prohibitive.…”

Section: Introductionmentioning

confidence: 99%

Ansor : Generating High-Performance Tensor Programs for Deep Learning

Zheng¹,

Jia²,

Sun³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

High-performance tensor programs are crucial to guarantee efficient execution of deep learning models. However, obtaining performant tensor programs for different operators on various hardware platforms is notoriously difficult. Currently, deep learning systems rely on vendor-provided kernel libraries or various search strategies to get performant tensor programs. These approaches either require significant engineering efforts in developing platform-specific optimization code or fall short in finding high-performance programs due to restricted search space and ineffective exploration strategy.We present Ansor, a tensor program generation framework for deep learning applications. Compared with existing search strategies, Ansor explores much more optimization combinations by sampling programs from a hierarchical representation of the search space. Ansor then fine-tunes the sampled programs with evolutionary search and a learned cost model to identify the best programs. Ansor can find high-performance programs that are outside the search space of existing stateof-the-art approaches. Besides, Ansor utilizes a scheduler to simultaneously optimize multiple subgraphs in a set of deep neural networks. Our evaluation shows that Ansor improves the execution performance of deep neural networks on the Intel CPU, ARM CPU, and NVIDIA GPU by up to 3.8×, 2.6×, and 1.7×, respectively.

show abstract

A Unified Optimization Approach for CNN Model Inference on Integrated GPUs

Cited by 28 publications

References 26 publications

Gem5-X

Gem5-X

SparseTrain

Ansor : Generating High-Performance Tensor Programs for Deep Learning

Contact Info

Product

Resources

About