2018
DOI: 10.1145/3264422
|View full text |Cite
|
Sign up to set email alerts
|

Performance Tuning and Analysis for Stencil-Based Applications on POWER8 Processor

Abstract: This article demonstrates an approach for combining general tuning techniques with the POWER8 hardware architecture through optimizing three representative stencil benchmarks. Two typical real-world applications, with kernels similar to those of the winning programs of the Gordon Bell Prize 2016 and 2017, are employed to illustrate algorithm modifications and a combination of hardware-oriented tuning strategies with the application algorithms. This work fills the gap between hardware capability and software pe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
12
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 11 publications
(12 citation statements)
references
References 32 publications
0
12
0
Order By: Relevance
“…This poses a fundamental challenge to acceleration. CPU implementations of these kernels [175] sufer from limited data locality and ineicient memory usage, as our rooline analysis in Figure 1 exposes. In Figure 3 we implement a copy stencil from the COSMO weather model to evaluate the performance potential of our HBM-based FPGA platform for the weather prediction application.…”
Section: Flux Resultsmentioning
confidence: 99%
See 3 more Smart Citations
“…This poses a fundamental challenge to acceleration. CPU implementations of these kernels [175] sufer from limited data locality and ineicient memory usage, as our rooline analysis in Figure 1 exposes. In Figure 3 we implement a copy stencil from the COSMO weather model to evaluate the performance potential of our HBM-based FPGA platform for the weather prediction application.…”
Section: Flux Resultsmentioning
confidence: 99%
“…Unlike the conventional stencil kernels, vertical advection has dependencies in the vertical direction, which leads to limited available parallelism and irregular memory access patterns. For example, when the input grid is stored by row, accessing data elements in the depth dimension typically results in many cache misses [175].…”
Section: Flux Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…Figure 1 shows the roo ine plot [104] for an IBM 16core POWER9 CPU (IC922). 1 After optimizing the vadvc and hdiff kernels for the POWER architecture by following the approach in [105], they achieve 29.1 GFLOP/s and 58.5 GFLOP/s, respectively, for 64 threads. Our roo ine analysis indicates that these kernels are constrained by the host DRAM bandwidth.…”
Section: Introductionmentioning
confidence: 99%