Automatic Horizontal Fusion for GPU Kernels

Li, Ao; Zheng, Bojian; Pekhimenko, Gennady; Long, Fan

doi:10.1109/cgo53902.2022.9741270

Cited by 19 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Automatic fusion of GPU kernels is a known optimization technique used for accelerating many scientific [12][13][14] and deep learning applications, 15,16 and is always bound to some compiler technologies. Aggregating knowledge from other studies, 17,18 we can formulate three distinct reasons for GPU kernel fusion: (1) to achieve better instruction latency hiding by fusing two data independent kernels that require different kinds of GPU resources; (2) to eliminate intermediate data round trips by fusing neighboring data depended kernels; (3) to reduce energy consumption and thus to improve GPU power efficiency. It is worth pointing out that reason (2) is the most common because many GPU kernels are memory-bound and data dependent.…”

Section: Related Workmentioning

confidence: 99%

“…This typically leads to many possible combinations. The algorithms for finding the best substitution graph are different and can be based on either some rules, 15,16 or empirical searches, 13 or exhaustive searches coupled with automatic benchmarking 17 and performance models for pruning search spaces, 12 or dynamic programming. 18 Due to specifics of the ADER-DG method, we are using a greedy approach and trying to fuse the longest sequence of batched GEMMs kernels extracted from streams of YATeTo's instructions using a simple Finite Automata (see Section 5.3).…”

Section: Related Workmentioning

confidence: 99%

“…In contrast to works 12,17,18 that rely on rules and heuristics for inserting block level synchronizations within a fused kernel, we take advantage of the intermediate representation of our code generator and employ a simple data flow analysis to insert the smallest possible number of synchronization instructions (see Section 5.2).…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Fused GEMMs towards an efficient GPU implementation of the ADER‐DG method in SeisSol

Dorozhinskii,

Gadeschi,

Bader

2024

Concurrency and Computation

View full text Add to dashboard Cite

SummaryThis study shows how GPU performance of the ADER discontinuous Galerkin method in SeisSol (an earthquake simulation software) can be further improved while preserving its original design that ensures high CPU performance. We introduce a new code generator (“ChainForge”) that fuses subsequent batched matrix multiplications (“GEMMs”) into a single GPU kernel, holding intermediate results in shared memory as long as necessary. The generator operates as an external module linked against SeisSol's domain specific language YATeTo and, as a result, the original SeisSol source code remains mainly unchanged. In this paper, we discuss several challenges related to automatic fusion of GPU kernels and provide solutions to them. By and large, we gain 60% in performance of SeisSol's wave propagation solver using Fused‐GEMMs compared to the original GPU implementation. We demonstrated this on benchmarks as well as on a real production scenario simulating the Northridge 1994 earthquake.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Fused GEMMs towards an efficient GPU implementation of the ADER‐DG method in SeisSol

Dorozhinskii,

Gadeschi,

Bader

2024

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…ALT addresses the two limitations via 1) the generic layout transformation submodule, which requires no re-implementation, and is also independent of the loop transformation to achieve the decoupling; 2) an autotuning module at a higher level to orchestrate the cross-layer joint tuning while guaranteeing efficiency. As for recent loop optimization techniques [2,3,5,21,42,65,66,73,78,80,85,[89][90][91], such as delicate cost models [3,5,42,73], aggressive operator fusion [21,40,46,50,80,90], and micro-kernel construction [91], they are complementary to ALT.…”

Section: Related Workmentioning

confidence: 99%

ALT: Boosting Deep Learning Performance by Breaking the Wall between Graph and Operator Level Optimizations

Zhang¹,

Xu²,

Peng³

et al. 2022

Preprint

View full text Add to dashboard Cite

Deep learning models rely on highly optimized tensor libraries for efficient inference on heterogeneous hardware. Current deep compilers typically predetermine layouts of tensors and then optimize loops of operators. However, such unidirectional and one-off workflow strictly separates graph-level optimization and operator-level optimization into different system layers, missing opportunities for unified tuning.This paper proposes ALT, a compiler that performs joint graphand operator-level optimizations for deep models. ALT provides a generic transformation module to manipulate layouts and loops with easy-to-use primitive functions. ALT further integrates an auto-tuning module that jointly optimizes graph-level data layouts and operator-level loops while guaranteeing efficiency. Experimental results show that ALT significantly outperforms state-of-the-art compilers (e.g., Ansor) in terms of both single operator performance (e.g., 1.5× speedup on average) and end-to-end inference performance (e.g., 1.4× speedup on average).

show abstract

“…Kernel fusion, referred to as operator fusion in the context of neural networks, has become a common technique to improve the performance of neural networks [27], and linear algebra [11]. Despite extensive research on this topic, [17,4], effectiveness of kernel fusion highly depends on various prior and subsequent optimizations, which will be focused in our study.…”

Section: Introductionmentioning

confidence: 99%

A GPU optimization workflow for real-time execution of ultra-high frame rate computer vision applications

Nourazar,

Booth,

Goossens

2023

J Real-Time Image Proc

View full text Add to dashboard Cite

This work proposes a GPU optimization methodology for real-time execution of ultra high frame rate applications with small frame sizes. While the use of GPUs for offline processing is well-established, real-time execution remains challenging due to the lack of real-time execution guarantees, especially for embedded GPUs. Our methodology introduces guidelines and a workflow by focusing on: (a) controlling latency by means of minimization of CPU-GPU interactions; (b) computation pruning; and (c) inter/intrakernel optimizations. Furthermore, our approach takes advantage of multi-frame processing to attain significantly higher throughput at the cost of increased latency when the application permits such trade-offs. To evaluate our optimization methodology, we applied it to the monitoring and controlling of laser powder bed fusion machines, a widely used metal additive manufacturing technique. Results show that in the considered application, the required performance could be obtained on a Jetson Xavier AGX platform, and by sacrificing latency, significantly higher throughput was achieved.

show abstract

Automatic Horizontal Fusion for GPU Kernels

Cited by 19 publications

References 0 publications

Fused GEMMs towards an efficient GPU implementation of the ADER‐DG method in SeisSol

Fused GEMMs towards an efficient GPU implementation of the ADER‐DG method in SeisSol

ALT: Boosting Deep Learning Performance by Breaking the Wall between Graph and Operator Level Optimizations

A GPU optimization workflow for real-time execution of ultra-high frame rate computer vision applications

Contact Info

Product

Resources

About