Cache Accurate Time Skewing in Iterative Stencil Computations

Strzodka, Robert; Shaheen, Mohammed; Pająk, Dawid; Seidel, Hans‐Peter

doi:10.1109/icpp.2011.47

Cited by 63 publications

(59 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Figure 2a shows three perspectives of our space-time block diamond tile. It is similar to the work of (Strzodka et al 2011), except we use multi-threaded wavefronts instead of a single-thread wavefront. We utilize an auto-tuning strategy to select the most performance-efficient diamond tile size.…”

Section: Methodology: Multi-threaded Wavefront Diamond Blockingmentioning

confidence: 86%

Towards Fast Reverse Time Migration Kernels using Multi-threaded Wavefront Diamond Tiling

Malas

Hager²,

Ltaief

et al. 2015

Second EAGE Workshop on High Performance Computing for Upstream

View full text Add to dashboard Cite

Section: Methodology: Multi-threaded Wavefront Diamond Blockingmentioning

confidence: 86%

Towards Fast Reverse Time Migration Kernels using Multi-threaded Wavefront Diamond Tiling

Malas

Hager²,

Ltaief

et al. 2015

Second EAGE Workshop on High Performance Computing for Upstream

View full text Add to dashboard Cite

“…Most of previous literature on stencil optimization work only on very low-order (=2) stencils [Datta et al, 2008, Datta, 2009, Nguyen et al, 2010, Henretty et al, 2011, Strzodka et al, 2011, Zumbusch, 2012, Wonnacott and Strout, 2012. These strategies may not work e↵ectively for high-order(=16 or more) stencils, or even the low-order(=4) stencils.…”

Section: Literature Reviewmentioning

confidence: 99%

“…The CATS algorithm [Strzodka et al, 2011] (time-skewing with diamond spatial block scheme) demonstrated that on a quad-core 3.2GHz Xeon X5482, the GFLOPs/sec dropped by 2.8x when the order of a 3D stencil increased from 2 to 6. Zumbusch [Zumbusch, 2013] also observed the performance dropped by 3x on a single-core Sandy Bridge Intel i7-2600 when the order of a 3D stencil increased from 2 to 12.…”

Section: Literature Reviewmentioning

confidence: 99%

“…Most of the literatures on stencil vectorization use SIMD intrinsics [Datta et al, 2008, Datta, 2009, Dursun et al, 2009, Henretty et al, 2011, Strzodka et al, 2011, Dursun et al, 2012, Zumbusch, 2012, Zumbusch, 2013, and [Datta et al, 2008, Dursun et al, 2012 explicitly claimed that their compilers had failed to auto-vectorize the stencil codes. Borges [Borges, 2011] gives an example of auto-vectorizing an 8th order stencil kernel, however their procedures only work for stencils of a fixed order because all the finite di↵erence terms are explicitly written in their scheme, so whenever the stencil order changes, the codes need to be rewritten.…”

Section: Literature Reviewmentioning

confidence: 99%

See 1 more Smart Citation

Wave equation based stencil optimizations on multi-core CPU

Zhou

Symes

2014

SEG Technical Program Expanded Abstracts 2014

View full text Add to dashboard Cite

Wave propagation stencil kernels are engines of seismic imaging algorithms. These kernels are both compute-and memory-intensive. This work targets improving the performance of wave equation based stencil code parallelized by OpenMP on a multi-core CPU. To achieve this goal, we explored two techniques: improving vectorization by using hardware SIMD technology, and reducing memory tra c to mitigate the bottleneck caused by limited memory bandwidth. We show that with loop interchange, memory alignment, and compiler hints, both icc and gcc compilers can provide fully-vectorized stencil code of any order with performance comparable to that of SIMD intrinsic code. To reduce cache misses, we present three methods in the context of OpenMP parallelization: rearranging loop structure, blocking thread accesses, and temporal loop blocking. Our results demonstrate that fully-vectorized high-order stencil code will be about 2X faster if implemented with either of the first two methods, and fully-vectorized low-order stencil code will be about 1.2X faster if implemented with the combination of the last two methods.Our final best-performing code achieves 20%⇠30% of peak GFLOPs/sec, depending on stencil order and compiler.

show abstract

“…Henretty et al [11] use a DSL-based approach relying on different versions of split tiling in combined with a data layout transformation to generate efficient SIMD CPU code. Strzodka [22] uses an in-tile wavefront traversal technique to achieve efficient cache use even with tile sizes larger than the available cache memory. All these approaches generate ef-ficient CPU code.…”

Section: Related Workmentioning

confidence: 99%

The Relation Between Diamond Tiling and Hexagonal Tiling

Grosser

Verdoolaege

Cohen

et al. 2014

Parallel Process. Lett.

View full text Add to dashboard Cite

Iterative stencil computations are important in scientific computing and more also in the embedded and mobile domain. Recent publications have shown that tiling schemes that ensure concurrent start provide efficient ways to execute these kernels. Diamond tiling and hybrid-hexagonal tiling are two tiling schemes that enable concurrent start. Both have different advantages: diamond tiling has been integrated in a general purpose optimization framework and uses a cost function to choose among tiling hyperplanes, whereas the greater flexibility with tile sizes for hybrid-hexagonal tiling has been exploited for effective generation of GPU code.In this paper we undertake a comparative study of these two tiling approaches and propose a hybrid approach that combines them. We analyze the effects of tile size and wavefront choices on tile-level parallelism, and formulate constraints for optimal diamond tile shapes. We then extend, for the case of two dimensions, the diamond tiling formulation into a hexagonal tiling one, which offers both the flexibility of hexagonal tiling and the generality of the original diamond tiling implementation. We also show how to compute tile sizes that maximize the compute-to-communication ratio, and apply this result to compare the best achievable ratio and the associated synchronization overhead for diamond and hexagonal tiling.

show abstract

Cache Accurate Time Skewing in Iterative Stencil Computations

Cited by 63 publications

References 25 publications

Towards Fast Reverse Time Migration Kernels using Multi-threaded Wavefront Diamond Tiling

Towards Fast Reverse Time Migration Kernels using Multi-threaded Wavefront Diamond Tiling

Wave equation based stencil optimizations on multi-core CPU

The Relation Between Diamond Tiling and Hexagonal Tiling

Contact Info

Product

Resources

About