A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations

Meng, Jiayuan; Skadron, Kevin

doi:10.1007/s10766-010-0142-5

Cited by 41 publications

(33 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Recent work has shown promise for high performance by use of overlapped tiling on GPUs [11] for stencil computations. In this paper, we present an automated approach to generate efficient overlapped tiling code for stencil computations on GPUs.…”

Section: Stencil Computationsmentioning

confidence: 99%

“…Our equivalent Jacobi 2-D stencil achieves 49.5 GFlop/s in double-precision mode on the GTX 580. Meng et al [11] report approximately 2 × 10 6 cycles per iteration on a GTX 280 for a Poisson stencil that has been manually tiled using overlapped tiling with a time tile size of 3. With a clock speed of 1.3 GHz, this gives approximately 70.2 GFlop/s.…”

Section: Performance Analysismentioning

confidence: 99%

“…A number of recent studies have focused on optimizing stencil computations on multicore CPUs [2,6,8,17,19] as well as GPUs [11][12][13].…”

Section: Introductionmentioning

confidence: 99%

“…The standard approach to time-tiling of stencil computations requires loop skewing to make tiling legal and this results in loss of inter-tile concurrency [10], since inter-tile dependences are introduced in the spatial directions due to the skewing. The approach of "overlapped tiling" [10], also called "ghost zone" optimization [3,11], has been used for preserving concurrency in parallel time-tiled execution of stencil computations. However, we are unaware of any fully automated compiler approach for the generation of overlapped-tiling code for execution on GPUs.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

High-performance code generation for stencil computations on GPU architectures

Holewinski

Pouchet

Sadayappan

2012

Proceedings of the 26th ACM International Conference on Supercomputing

214

156

View full text Add to dashboard Cite

Stencil computations arise in many scientific computing domains, and often represent time-critical portions of applications. There is significant interest in offloading these computations to high-performance devices such as GPU accelerators, but these architectures offer challenges for developers and compilers alike. Stencil computations in particular require careful attention to off-chip memory access and the balancing of work among compute units in GPU devices.In this paper, we present a code generation scheme for stencil computations on GPU accelerators, which optimizes the code by trading an increase in the computational workload for a decrease in the required global memory bandwidth. We develop compiler algorithms for automatic generation of efficient, time-tiled stencil code for GPU accelerators from a high-level description of the stencil operation. We show that the code generation scheme can achieve high performance on a range of GPU architectures, including both nVidia and AMD devices.

show abstract

Section: Stencil Computationsmentioning

confidence: 99%

Section: Performance Analysismentioning

confidence: 99%

“…A number of recent studies have focused on optimizing stencil computations on multicore CPUs [2,6,8,17,19] as well as GPUs [11][12][13].…”

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

High-performance code generation for stencil computations on GPU architectures

Holewinski

Pouchet

Sadayappan

2012

Proceedings of the 26th ACM International Conference on Supercomputing

214

156

View full text Add to dashboard Cite

show abstract

“…In the literature, the problem of designing efficient implementations for this class of algorithms has been addressed for both CPUs ( [5], [6]) and GPGPUs ( [7], [8]): on such architectures, the main problems that have been faced are the memory organization and the data transfers.…”

Section: State-of-the-art Implementationsmentioning

confidence: 99%

A high-level synthesis flow for the implementation of iterative stencil loop algorithms on FPGA devices

Nacci

Rana

Bruschi

et al. 2013

Proceedings of the 50th Annual Design Automation Conference

View full text Add to dashboard Cite

The automatic generation of hardware implementations for a given algorithm is generally a difficult task, especially when data dependencies span across multiple iterations such as in iterative stencil loops (ISLs). In this paper, we introduce an automatic design flow to extract parallelism from an ISL algorithm and perform a design space exploration to identify its best FPGA hardware implementation, in terms of both area and throughput. Experimental results show that the proposed methodology generates hardware designs whose performance is comparable to the one of manuallyoptimized solutions, and orders of magnitude higher than the implementations generated by commercial high-level synthesis tools.

show abstract

TOAST: Automatic tiling for iterative stencil computations on GPUs

Rocha

Pereira

Ramos

et al. 2017

Concurrency and Computation

View full text Add to dashboard Cite

Summary The stencil pattern is important in many scientific and engineering domains, spurring great interest from researchers and industry. In recent years, various optimizations have been proposed for parallel stencil applications running on graphics processing units (GPUs). In particular, tiling is a technique that can significantly enhance application performance by improving data locality and by reducing the volume of communication between host memory and GPU. In addition, tiling enables stencil applications to process inputs that are larger than the physical GPU memory. However, implementing tiling efficiently is complex, time‐consuming, and error‐prone. In this paper, we propose transparently optimized automatic stencil tiling (TOAST), an automatic tiling mechanism for iterative stencil computations running on GPUs; TOAST has 3 main benefits: (1) It incorporates an optimization model that seeks to maximize data reuse within tiles while respecting the amount of dynamically available GPU memory; (2) it offers a virtualized GPU memory for stencil computations, allowing for large input data; and (3) it performs optimal tiling transparently to the developer of the parallel stencil application. The current implementation of TOAST augments the PSkel framework with an internal solver based on genetic algorithms. Our experimental results show that TOAST improves the performance of iterative stencil applications by up to 13 × compared with their multithreaded (central processing unit–based) optimized versions and up to 48 × compared with a naive tiling approach on GPU. The TOAST mechanism is able to automatically achieve a low percentual overhead of data management compared with actual stencil computation.

show abstract

A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations

Cited by 41 publications

References 33 publications

High-performance code generation for stencil computations on GPU architectures

High-performance code generation for stencil computations on GPU architectures

A high-level synthesis flow for the implementation of iterative stencil loop algorithms on FPGA devices

TOAST: Automatic tiling for iterative stencil computations on GPUs

Contact Info

Product

Resources

About