2011
DOI: 10.1007/978-3-642-23397-5_40
|View full text |Cite
|
Sign up to set email alerts
|

Model-Driven Tile Size Selection for DOACROSS Loops on GPUs

Abstract: Abstract. DOALL loops are tiled to exploit DOALL parallelism and data locality on GPUs. In contrast, due to loop-carried dependences, DOACROSS loops must be skewed first in order to make tiling legal and exploit wavefront parallelism across the tiles and within a tile. Thus, tile size selection, which is performance-critical, becomes more complex for DOACROSS loops than DOALL loops on GPUs. This paper presents a model-driven approach to automating this process. Validation using 1D, 2D and 3D SOR solvers shows … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2012
2012
2020
2020

Publication Types

Select...
4
4
1

Relationship

1
8

Authors

Journals

citations
Cited by 12 publications
(5 citation statements)
references
References 19 publications
0
5
0
Order By: Relevance
“…In both cases, the best tile sizes for tiling hyperplanes are determined empirically by using a cost model from [9]. Figure 12 shows the speedups achieved by our framework over PLUTO.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…In both cases, the best tile sizes for tiling hyperplanes are determined empirically by using a cost model from [9]. Figure 12 shows the speedups achieved by our framework over PLUTO.…”
Section: Resultsmentioning
confidence: 99%
“…When searching for tiling hyperplanes with balanced intra-tile wavefronts and performing subsequent loop transformations, we make use of PLUTO's polyhedral implementation. We previously developed a cost model regarding tile size selection for GPUs [9]. This model estimates the execution times of a loop nest with different tile sizes and thread organizations.…”
Section: The Compiler Frameworkmentioning
confidence: 99%
“…That is, they do not model the KERNEL stage with a further thorough observation. In Di and Xue (2011), an explicit execution time model with different parameters is proposed. In their model, the workload for the kernel computing is proportional to the number of warps (units of execution).…”
Section: Preliminarymentioning
confidence: 99%
“…Nguyen et al [13] proposed a data blocking scheme that optimizes both the memory bandwidth and computation resources on GPU devices. Peng et al [7] investigate the selection of tile sizes for GPU kernels, with an emphasis on stencil computations. However, none of these works consider fully automatic, high-performance code generation for stencil computations on GPUs.…”
Section: Related Workmentioning
confidence: 99%