The Tensor Core Unit (TCU) has been increasingly adopted on modern high performance processors, specialized in boosting the performance of general matrix multiplication (GEMM). Due to its highly optimized hardware design, TCU can significantly accelerate GEMM-based operations widely used in scientific as well as deep learning applications. However, there is few work exploiting TCU to accelerate non-GEMM operations such as stencil computation that is also important in the field of high performance computing. To the best of our knowledge, there is no previous work that adapts stencil computation to TCU efficiently by considering its unique characteristics. In this paper, we propose a new method called TCstencil to adapt TCU for accelerating stencil computation. Specifically, we re-design the stencil computation as a series of reduction and summation operations in order to leverage the computing power of TCU. In addition, we propose corresponding optimizations for better exploiting TCU and memory hierarchy on GPU. We evaluate our method with different stencils and input mesh sizes on NVIDIA A100 and V100 GPUs. The experiment results demonstrate our method can achieve superior performance compared to the state-of-the-art stencil optimization frameworks.
The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability. Among the exiting deep learning compilers, TVM is well known for its efficiency in code generation and optimization across diverse hardware devices. In the meanwhile, the Sunway many-core processor renders itself as a competitive candidate for its attractive computational power in both scientific and deep learning applications. This paper combines the trends in these two directions. Specifically, we propose swTVM that extends the original TVM to support ahead-of-time compilation for architecture requiring cross-compilation such as Sunway. In addition, we leverage the architecture features during the compilation such as core group for massive parallelism, DMA for high bandwidth memory transfer and local device memory for data locality, in order to generate efficient code for deep learning application on Sunway. The experimental results show the ability of swTVM to automatically generate code for various deep neural network models on Sunway. The performance of automatically generated code for AlexNet and VGG-19 by swTVM achieves 6.71× and 2.45× speedup on average than handoptimized OpenACC implementations on convolution and fully connected layers respectively. This work is the first attempt from the compiler perspective to bridge the gap of deep learning and high performance architecture particularly with productivity and efficiency in mind. We would like to open source the implementation so that more people can embrace the power of deep learning compiler and Sunway many-core processor.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.