Tiling stencil computations to maximize parallelism

Bandishti, Vinayaka; Pananilath, Irshad; Bondhugula, Uday

doi:10.1109/sc.2012.107

Cited by 121 publications

(148 citation statements)

References 26 publications

Supporting

Mentioning

145

Contrasting

Order By: Relevance

“…The next four contiguous elements Bdlt [4:7] in the transformed layout correspond to B [1], B [7], B [13], and B [19], etc. Thus the sum of aligned vectors, Bdlt[0:3]+Bdlt [4:7]+Bdlt [8:11], computes < B[0] + B [1] + B [2], B [6] + B [7] + B [8], B [12] + B [13] + B [14], B [18] + B [19]+ B [20] >. Thus the fundamental problem with vectorized addition of contiguously located elements in memory is overcome in the transformed layout where operands that need to be combined are located in the same slot of different vectors rather than in different slots of the same vector.…”

Section: Problem Descriptionmentioning

confidence: 99%

“…We compare performance to the diamond-tiling system used by Pluto [2], the cache-oblivious tiling system used by Pochoir [18], and the Intel C Compiler v13.0.…”

Section: Experimental Evaluationmentioning

confidence: 99%

“…The standard approach to time-tiling of stencil computations requires skewing of the iteration space and introduces intertile dependences along the spatial dimensions and thereby restricts parallelism to wavefront parallelism in the tile space. Alternate approaches to tiling using overlapped tiles [13], split-tiles [9] or "diamond" tiles [2,14] enable a greater degree of inter-tile parallelism. In this paper we have developed compiler algorithms for enabling split-tiling in conjunction with a dimension-lifted-transpose data layout transformation.…”

Section: Related Workmentioning

confidence: 99%

“…The Chapel work enables automated distributed memory parallelization of stencils but does not address time-tiling or vectorization. Very recent work by Bandishti et al [2] enhanced the Pluto compiler to incorporate a strategy for "diamond" tiling, that is particularly effective in parallelizing stencil computations. However, our approach to combining data layout transformation in conjunction with split-tiling provides significant performance advantages over Pluto for a range of stencil benchmarks, as seen from our experimental results.…”

Section: Related Workmentioning

confidence: 99%

“…In this paper, we develop an integrated approach to perform tiling in conjunction with DLT transformation to generate efficient parallel code for stencil computations over large data sets on shared memory multiprocessors. We compare performance with code generated by the Pochoir stencil compiler [18] and Pluto [2,4] for several benchmarks on multiple target multicore processors, demonstrating strong performance benefits for 1D and 2D stencils. The paper makes the following contributions:…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

A stencil compiler for short-vector SIMD architectures

Henretty

Veras

Franchetti

et al. 2013

Proceedings of the 27th International ACM Conference on International Conference on Supercomputing

110

View full text Add to dashboard Cite

Stencil computations are an integral component of applications in a number of scientific computing domains. Short-vector SIMD instruction sets are ubiquitous on modern processors and can be used to significantly increase the performance of stencil computations. Traditional approaches to optimizing stencils on these platforms have focused on either short-vector SIMD or data locality optimizations. In this paper, we propose a domain-specific language and compiler for stencil computations that allows specification of stencils in a concise manner and automates both locality and short-vector SIMD optimizations, along with effective utilization of multi-core parallelism. Loop transformations to enhance data locality and enable load-balanced parallelism are combined with a data layout transformation to effectively increase the performance of stencil computations. Performance increases are demonstrated for a number of stencils on several modern SIMD architectures.

show abstract

Section: Problem Descriptionmentioning

confidence: 99%

“…We compare performance to the diamond-tiling system used by Pluto [2], the cache-oblivious tiling system used by Pochoir [18], and the Intel C Compiler v13.0.…”

Section: Experimental Evaluationmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A stencil compiler for short-vector SIMD architectures

Henretty

Veras

Franchetti

et al. 2013

Proceedings of the 27th International ACM Conference on International Conference on Supercomputing

110

View full text Add to dashboard Cite

show abstract

TOAST: Automatic tiling for iterative stencil computations on GPUs

Rocha

Pereira

Ramos

et al. 2017

Concurrency and Computation

View full text Add to dashboard Cite

Summary The stencil pattern is important in many scientific and engineering domains, spurring great interest from researchers and industry. In recent years, various optimizations have been proposed for parallel stencil applications running on graphics processing units (GPUs). In particular, tiling is a technique that can significantly enhance application performance by improving data locality and by reducing the volume of communication between host memory and GPU. In addition, tiling enables stencil applications to process inputs that are larger than the physical GPU memory. However, implementing tiling efficiently is complex, time‐consuming, and error‐prone. In this paper, we propose transparently optimized automatic stencil tiling (TOAST), an automatic tiling mechanism for iterative stencil computations running on GPUs; TOAST has 3 main benefits: (1) It incorporates an optimization model that seeks to maximize data reuse within tiles while respecting the amount of dynamically available GPU memory; (2) it offers a virtualized GPU memory for stencil computations, allowing for large input data; and (3) it performs optimal tiling transparently to the developer of the parallel stencil application. The current implementation of TOAST augments the PSkel framework with an internal solver based on genetic algorithms. Our experimental results show that TOAST improves the performance of iterative stencil applications by up to 13 × compared with their multithreaded (central processing unit–based) optimized versions and up to 48 × compared with a naive tiling approach on GPU. The TOAST mechanism is able to automatically achieve a low percentual overhead of data management compared with actual stencil computation.

show abstract

A study on popular auto‐parallelization frameworks

Prema

Nasre

Jehadeesan

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

We study five popular auto-parallelization frameworks (Cetus, Par4all, Rose, ICC, and Pluto) andcompare them qualitatively as well as quantitatively. All the frameworks primarily deal with loop parallelization but differ in the techniques used to identify parallelization opportunities. Due to this variance, various aspects, such as certain loop transformations, are supported only in a few frameworks. The frameworks exhibit varying abilities in handling loop-carried dependence and, therefore, achieve different amounts of speedup on widely used PolyBench and NAS parallel benchmarks. In particular, Intel C Compiler (ICC) fares as an overall good parallelizer. Our study also highlights the need for more sophisticated analyses, user-driven parallelization, and meta-auto-parallelizer that provides combined benefits of various frameworks. KEYWORDS loop-carried dependence, loop transformations, privatization, vectorization INTRODUCTIONQuest for performance has made multicore processors mainstream. It becomes vital for the programmers and algorithm developers to exploit these ubiquitous advanced architectures with parallel programming. Exploiting the potential of multicore processors through parallel programming is a significant challenge. Amid several approaches to tame this challenge, a promising and programmer-friendly approach is automatic parallelization. 1-3 Auto-parallelizers eliminate the need for a programmer to transform a sequential code into a parallel code, which is quite attractive.Source-to-source transformations with insertion of parallel directives are performed by auto-parallelizers, such as Cetus, 4-7 Par4all, 8,9 Pluto,10,11 Parallware, 12,13 Rose, 14,15 Intel C Compiler, 16 LLVM (Low Level Virtual Machine) Polly, 17 ParaWise, 18 ParaGraph, 19 SUIF (Stanford University Intermediate Format), 20-22 and Polaris. 23,24 Although these existing parallelizers offer considerable benefits, they still fall short of fully replacing the manual transformations. Several parallelizers do not utilize all the static information available, whereas several others fall short of modeling precision. It leads to either missed parallelism opportunities or unwanted parallelization of sequential codes. As a consequence, compared to the original sequential version, the auto-parallelized code may lead to parallelization overheads and may exhibit poorer performance. 25Earlier studies 26,27 of the parallelizing compilers using Perfect benchmarks performed a detailed analysis of the code restructuring techniques.The techniques include induction variable elimination, scalar expansion, forward substitution, strip mining, and loop interchange. The studies found that some of the programs showed improvements and that scalar expansion led to positive results. An effectiveness study on the Polaris compiler with OpenMP (Open Multi-Processing) 28 parallel code using Perfect benchmarks proclaimed a performance lag in small parallel loops. It also illustrated the importance of reduction operation, which resulted in a moderate (10%) performance...

show abstract

Tiling stencil computations to maximize parallelism

Cited by 121 publications

References 26 publications

A stencil compiler for short-vector SIMD architectures

A stencil compiler for short-vector SIMD architectures

TOAST: Automatic tiling for iterative stencil computations on GPUs

A study on popular auto‐parallelization frameworks

Contact Info

Product

Resources

About