Model-Driven Tile Size Selection for DOACROSS Loops on GPUs

Peng, Di; Xue, Jingling

doi:10.1007/978-3-642-23397-5_40

Cited by 12 publications

(5 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In both cases, the best tile sizes for tiling hyperplanes are determined empirically by using a cost model from [9]. Figure 12 shows the speedups achieved by our framework over PLUTO.…”

Section: Resultsmentioning

confidence: 99%

“…When searching for tiling hyperplanes with balanced intra-tile wavefronts and performing subsequent loop transformations, we make use of PLUTO's polyhedral implementation. We previously developed a cost model regarding tile size selection for GPUs [9]. This model estimates the execution times of a loop nest with different tile sizes and thread organizations.…”

Section: The Compiler Frameworkmentioning

confidence: 99%

See 1 more Smart Citation

Automatic Parallelization of Tiled Loop Nests with Enhanced Fine-Grained Parallelism on GPUs

Peng

et al. 2012

2012 41st International Conference on Parallel Processing

Self Cite

View full text Add to dashboard Cite

Abstract-Automatically parallelizing loop nests into CUDA kernels must exploit the full potential of GPUs to obtain high performance. One state-of-the-art approach makes use of the polyhedral model to extract parallelism from a loop nest by applying a sequence of affine transformations to the loop nest. However, how to automate this process to exploit both intraand inter-SM parallelism for GPUs remains a challenging problem. Presently, compilers may generate code significantly slower than hand-optimized code for certain applications.This paper describes a compiler framework for tiling and parallelizing loop nests with uniform dependences into CUDA code. We aim to improve two levels of wavefront parallelism. We find tiling hyperplanes by embedding parallelismenhancing constraints in the polyhedral model to maximize intra-tile, i.e., intra-SM parallelism. This improves the load balance among the SPs in an SM executing a wavefront of loop iterations within a tile. We eliminate parallelism-hindering false dependences to maximize inter-tile, i.e., inter-SM parallelism. This improves the load balance among the SMs executing a wavefront of tiles. Our approach has been implemented in PLUTO and validated using eight benchmarks on two different NVIDIA GPUs (C1060 and C2050). Compared to PLUTO, our approach achieves 2 -5.5X speedups across the benchmarks. Compared to highly hand-optimized 1-D Jacobi (3 points), 2-D Jacobi (5 points), 3-D Jacobi (7 points) and 3-D Jacobi (27 points), our speedups, 1.17X, 1.41X, 0.97X and 0.87X with an average of 1.10X on C1060 and 1.24X, 1.20X, 0.86X and 0.95X with an average of 1.06X on C2050, are competitive.

show abstract

“…In both cases, the best tile sizes for tiling hyperplanes are determined empirically by using a cost model from [9]. Figure 12 shows the speedups achieved by our framework over PLUTO.…”

Section: Resultsmentioning

confidence: 99%

Section: The Compiler Frameworkmentioning

confidence: 99%

Automatic Parallelization of Tiled Loop Nests with Enhanced Fine-Grained Parallelism on GPUs

Peng

et al. 2012

2012 41st International Conference on Parallel Processing

Self Cite

View full text Add to dashboard Cite

show abstract

“…That is, they do not model the KERNEL stage with a further thorough observation. In Di and Xue (2011), an explicit execution time model with different parameters is proposed. In their model, the workload for the kernel computing is proportional to the number of warps (units of execution).…”

Section: Preliminarymentioning

confidence: 99%

Software pipelining for graphic processing unit acceleration: Partition, scheduling and granularity

Liu

Qiu

Jiang

et al. 2015

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

The graphic processing unit (GPU) is becoming increasingly popular as a performance accelerator in various applications requiring high-performance parallel computing capability. In a central processing unit (CPU) or GPU hybrid system, software pipelining is a major task in order to deliver accelerated performance, where hiding CPU–GPU communication overheads by splitting a large task into small units is the key challenge. In this paper, we carry out a systematic investigation into task partitioning in order to achieve maximum performance gain. We first validate the advantage of even partition strategy, and then propose the optimal scheduling, with detailed study into how to achieve optimal unit size (data granularity) in an analytical framework. Experiments on AMD and NVIDIA GPU platforms demonstrate that our approaches achieve around 31 – 59% performance improvement using software pipelining.

show abstract

“…Nguyen et al [13] proposed a data blocking scheme that optimizes both the memory bandwidth and computation resources on GPU devices. Peng et al [7] investigate the selection of tile sizes for GPU kernels, with an emphasis on stencil computations. However, none of these works consider fully automatic, high-performance code generation for stencil computations on GPUs.…”

Section: Related Workmentioning

confidence: 99%

High-performance code generation for stencil computations on GPU architectures

Holewinski

Pouchet

Sadayappan

2012

Proceedings of the 26th ACM International Conference on Supercomputing

213

156

View full text Add to dashboard Cite

Stencil computations arise in many scientific computing domains, and often represent time-critical portions of applications. There is significant interest in offloading these computations to high-performance devices such as GPU accelerators, but these architectures offer challenges for developers and compilers alike. Stencil computations in particular require careful attention to off-chip memory access and the balancing of work among compute units in GPU devices.In this paper, we present a code generation scheme for stencil computations on GPU accelerators, which optimizes the code by trading an increase in the computational workload for a decrease in the required global memory bandwidth. We develop compiler algorithms for automatic generation of efficient, time-tiled stencil code for GPU accelerators from a high-level description of the stencil operation. We show that the code generation scheme can achieve high performance on a range of GPU architectures, including both nVidia and AMD devices.

show abstract

Model-Driven Tile Size Selection for DOACROSS Loops on GPUs

Cited by 12 publications

References 19 publications

Automatic Parallelization of Tiled Loop Nests with Enhanced Fine-Grained Parallelism on GPUs

Automatic Parallelization of Tiled Loop Nests with Enhanced Fine-Grained Parallelism on GPUs

Software pipelining for graphic processing unit acceleration: Partition, scheduling and granularity

High-performance code generation for stencil computations on GPU architectures

Contact Info

Product

Resources

About