Accelerating explicit ODE methods on GPUs by kernel fusion

Korch, Matthias; Werner, Tim

doi:10.1002/cpe.4470

Cited by 8 publications

(6 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The application of kernel fusion to ODE methods on GPUs for general ODE systems was also considered 2 . For those systems it is only allowed to fuse RHS → LC, RHS → RED, LC → LC and LC → RED dependencies, while a global barrier is required for each LC → RHS dependency.…”

Section: Related Workmentioning

confidence: 99%

“…We have added the ability to generate multi‐workgroup tilings for explicit one‐step methods along a user defined dependency chain to our automatic prototype framework 2,3 . This framework allows a user to solve an arbitrary IVP by an arbitrary explicit ODE method of several supported classes (RK methods, PIRK methods, peer methods, Adams–Bashforth methods).…”

Section: Experimental Evaluationmentioning

confidence: 99%

“…To reduce the time‐to‐solution on GPUs, we use the well‐known technique of kernel fusion , which exploits the on‐die memories of the GPU (caches, scratchpad, registers) to increase data reuse. In our previous work, 2 we focused on kernel fusion for problems with an arbitrary coupling of the RHS. Since this allowed the fusion of different stages of the method only if they were independent, we lifted this restriction in a subsequent work 3 by a specialization in problems with a limited access distance, which commonly arise, for example, from stencil‐based problems (e.g., partial differential equations [PDEs] discretized by the method of lines) or block‐structured electrical circuits.…”

Section: Introductionmentioning

confidence: 99%

“…Remarks . Please note that for some methods A , b , and

\hat{b}

are not fully populated, thus thinning out the dependencies between the computations, which also can be exploited for improving the runtime 2 …”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

An in‐depth introduction of multi‐workgroup tiling for improving the locality of explicit one‐step methods for ODE systems with limited access distance on GPUs

Korch

Werner

2020

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Summary This article considers a locality optimization technique for the parallel solution of a special class of large systems of ordinary differential equations (ODEs) by explicit one‐step methods on GPUs. This technique is based on tiling across the stages of the one‐step method and is enabled by the special structure of the class of ODE systems considered, that is, the limited access distance. The focus of this article is on increasing the range of access distances for which the tiling technique can provide a speedup by joining the memory resources and the computational power of multiple workgroups for the computation of one tile (multi‐workgroup tiling). In particular, this article provides an extended in‐depth introduction and discussion of the multi‐workgroup tiling technique and its theoretical and technical foundations together with a new tuning option (mapping stride) and new experiments. The experiments performed show speedups of the multi‐workgroup tiling technique compared with traditional single‐workgroup tiling for two different Runge–Kutta methods on NVIDIAs Kepler and Volta architectures.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Experimental Evaluationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

“…Remarks . Please note that for some methods A , b , and

\hat{b}

are not fully populated, thus thinning out the dependencies between the computations, which also can be exploited for improving the runtime 2 …”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

An in‐depth introduction of multi‐workgroup tiling for improving the locality of explicit one‐step methods for ODE systems with limited access distance on GPUs

Korch

Werner

2020

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

show abstract

“…Another known approach is kernel fusion, a technique to fuse multiple memory intensive ops with data dependencies into a single kernel to reduce off chip memory accesses. Prior works have explored this idea extensively in database [32], image processing [5,16,23], HPC applications [18,30], and AI workloads [7,19]. However, there are two notable limitations when targeting memory intensive DL models.…”

Section: Introductionmentioning

confidence: 99%

FusionStitching: Boosting Execution Efficiency of Memory Intensive Computations for DL Workloads

Long,

Yang,

Lin

2019

Preprint

View full text Add to dashboard Cite

Performance optimization is the art of continuous seeking a harmonious mapping between the application domain and hardware. Recent years have witnessed a surge of deep learning (DL) applications in industry. Conventional wisdom for optimizing such workloads mainly focus on compute intensive ops (GEMM, Convolution, etc). Yet we show in this work, that the performance of memory intensive computations is vital to E2E performance in practical DL models.We propose FusionStitching, a optimization framework capable of fusing memory intensive elementwise, reduction and fine grained GEMM/Batched-GEMM ops, with or without data dependences, into large computation units, then mapping and transforming them into efficient GPU kernels. We formulate the fusion plan optimization as an integer linear programming (ILP) problem, and propose a set of empirical heuristics to reduce the combinatorial search space. In order to map optimized fusion plans to hardware, we propose a technique to effectively compose various groups of computations into a single GPU kernel, by fully leveraging on chip resources like scratchpads or registers. Experimental results on six benchmarks and four industry scale practical models are encouraging. Overall, FusionStitching can reach up to 5.7x speedup compared to Tensorflow baseline, and achieves 1.25x to 1.85x performance speedups compared to current state of the art, with 1.4x on average (geometric mean).

show abstract

Multi-workgroup Tiling to Improve the Locality of Explicit One-Step Methods for ODE Systems with Limited Access Distance on GPUs

Korch

Werner

2020

Parallel Processing and Applied Mathematics

View full text Add to dashboard Cite

Accelerating explicit ODE methods on GPUs by kernel fusion

Cited by 8 publications

References 32 publications

An in‐depth introduction of multi‐workgroup tiling for improving the locality of explicit one‐step methods for ODE systems with limited access distance on GPUs

An in‐depth introduction of multi‐workgroup tiling for improving the locality of explicit one‐step methods for ODE systems with limited access distance on GPUs

FusionStitching: Boosting Execution Efficiency of Memory Intensive Computations for DL Workloads

Multi-workgroup Tiling to Improve the Locality of Explicit One-Step Methods for ODE Systems with Limited Access Distance on GPUs

Contact Info

Product

Resources

About