2011 International Conference on Parallel Processing 2011
DOI: 10.1109/icpp.2011.47
|View full text |Cite
|
Sign up to set email alerts
|

Cache Accurate Time Skewing in Iterative Stencil Computations

Abstract: Abstract-We present a time skewing algorithm that breaks the memory wall for certain iterative stencil computations. A stencil computation, even with constant weights, is a completely memory-bound algorithm. For example, for a large 3D domain of 500 3 doubles and 100 iterations on a quad-core Xeon X5482 3.2GHz system, a hand-vectorized and parallelized naive 7-point stencil implementation achieves only 1.4 GFLOPS because the system memory bandwidth limits the performance. Although many efforts have been undert… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
59
0

Year Published

2014
2014
2022
2022

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 63 publications
(59 citation statements)
references
References 25 publications
0
59
0
Order By: Relevance
“…Figure 2a shows three perspectives of our space-time block diamond tile. It is similar to the work of (Strzodka et al 2011), except we use multi-threaded wavefronts instead of a single-thread wavefront. We utilize an auto-tuning strategy to select the most performance-efficient diamond tile size.…”
Section: Methodology: Multi-threaded Wavefront Diamond Blockingmentioning
confidence: 86%
“…Figure 2a shows three perspectives of our space-time block diamond tile. It is similar to the work of (Strzodka et al 2011), except we use multi-threaded wavefronts instead of a single-thread wavefront. We utilize an auto-tuning strategy to select the most performance-efficient diamond tile size.…”
Section: Methodology: Multi-threaded Wavefront Diamond Blockingmentioning
confidence: 86%
“…Most of previous literature on stencil optimization work only on very low-order (=2) stencils [Datta et al, 2008, Datta, 2009, Nguyen et al, 2010, Henretty et al, 2011, Strzodka et al, 2011, Zumbusch, 2012, Wonnacott and Strout, 2012. These strategies may not work e↵ectively for high-order(=16 or more) stencils, or even the low-order(=4) stencils.…”
Section: Literature Reviewmentioning
confidence: 99%
“…The CATS algorithm [Strzodka et al, 2011] (time-skewing with diamond spatial block scheme) demonstrated that on a quad-core 3.2GHz Xeon X5482, the GFLOPs/sec dropped by 2.8x when the order of a 3D stencil increased from 2 to 6. Zumbusch [Zumbusch, 2013] also observed the performance dropped by 3x on a single-core Sandy Bridge Intel i7-2600 when the order of a 3D stencil increased from 2 to 12.…”
Section: Literature Reviewmentioning
confidence: 99%
See 1 more Smart Citation
“…Henretty et al [11] use a DSL-based approach relying on different versions of split tiling in combined with a data layout transformation to generate efficient SIMD CPU code. Strzodka [22] uses an in-tile wavefront traversal technique to achieve efficient cache use even with tile sizes larger than the available cache memory. All these approaches generate ef-ficient CPU code.…”
Section: Related Workmentioning
confidence: 99%