2012 International Conference for High Performance Computing, Networking, Storage and Analysis 2012
DOI: 10.1109/sc.2012.107
|View full text |Cite
|
Sign up to set email alerts
|

Tiling stencil computations to maximize parallelism

Abstract: Most stencil computations allow tile-wise concurrent start, i.e., there always exists a face of the iteration space and a set of tiling hyperplanes such that all tiles along that face can be started concurrently. This provides load balance and maximizes parallelism. However, existing automatic tiling frameworks often choose hyperplanes that lead to pipelined start-up and load imbalance. We address this issue with a new tiling technique that ensures concurrent start-up as well as perfect load-balance whenever p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
145
0

Year Published

2013
2013
2021
2021

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 121 publications
(148 citation statements)
references
References 26 publications
3
145
0
Order By: Relevance
“…The next four contiguous elements Bdlt [4:7] in the transformed layout correspond to B [1], B [7], B [13], and B [19], etc. Thus the sum of aligned vectors, Bdlt[0:3]+Bdlt [4:7]+Bdlt [8:11], computes < B[0] + B [1] + B [2], B [6] + B [7] + B [8], B [12] + B [13] + B [14], B [18] + B [19]+ B [20] >. Thus the fundamental problem with vectorized addition of contiguously located elements in memory is overcome in the transformed layout where operands that need to be combined are located in the same slot of different vectors rather than in different slots of the same vector.…”
Section: Problem Descriptionmentioning
confidence: 99%
See 4 more Smart Citations
“…The next four contiguous elements Bdlt [4:7] in the transformed layout correspond to B [1], B [7], B [13], and B [19], etc. Thus the sum of aligned vectors, Bdlt[0:3]+Bdlt [4:7]+Bdlt [8:11], computes < B[0] + B [1] + B [2], B [6] + B [7] + B [8], B [12] + B [13] + B [14], B [18] + B [19]+ B [20] >. Thus the fundamental problem with vectorized addition of contiguously located elements in memory is overcome in the transformed layout where operands that need to be combined are located in the same slot of different vectors rather than in different slots of the same vector.…”
Section: Problem Descriptionmentioning
confidence: 99%
“…We compare performance to the diamond-tiling system used by Pluto [2], the cache-oblivious tiling system used by Pochoir [18], and the Intel C Compiler v13.0.…”
Section: Experimental Evaluationmentioning
confidence: 99%
See 3 more Smart Citations