Understanding stencil code performance on multicore architectures

Rahman, Shah Mohammad Faizur; Yi, Qing; Qasem, Apan

doi:10.1145/2016604.2016641

Cited by 28 publications

(30 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[10] 363 combine the cache misses of only the functions that contribute significantly towards the total cache misses. The profiler TAU (Tuning and Analysis Utilities) [21] was used to obtain the PAPI (Performance Application Programming Interface) counters like PAPI_L1_DCM and PAPI_L2_DCM [14]. Table VII shows that the Z decomposition is the worst, with maximum predicted and actual cache misses.…”

Section: Resultsmentioning

confidence: 99%

“…Performance optimization can start with domain decomposition at the macro-level. Figure 4 illustrates that traditional optimizations only consider reducing the cache misses [9] after performing domain decomposition [10], [11], [12], [13], [14]. We take a reverse approach in the sense that we derive a domain decomposition based on optimization of cache-misses.…”

Section: Or the Finite Element Methods (Fem)mentioning

confidence: 99%

“…Factors like Translation Look Aside Buffers (TLB) misses, mispredicted branches and hardware prefetches, etc. have also been used to predict the stencil code performance using statistics from performance counters [14]. Cache-aware [10], [11], [12] and Cache Oblivous/transcendental [19] algorithms form an orthogonal approach, with the former taking into account the architectural details of the cache memory hierarchy and the latter completely ignoring them.…”

Section: Or the Finite Element Methods (Fem)mentioning

confidence: 99%

See 2 more Smart Citations

A cache-aware approach to domain decomposition for stencil-based codes

Saxena

Jimack

Walkley

2016

2016 International Conference on High Performance Computing &Amp; Simulation (HPCS)

View full text Add to dashboard Cite

Abstract-Partial Differential Equations (PDEs) lie at the heart of numerous scientific simulations depicting physical phenomena. The parallelization of such simulations introduces additional performance penalties in the form of local and global synchronization among cooperating processes. Domain decomposition partitions the largest shareable data structures into sub-domains and attempts to achieve perfect load balance and minimal communication. Up to now research efforts to optimize spatial and temporal cache reuse for stencil-based PDE discretizations (e.g. finite difference and finite element) have considered sub-domain operations after the domain decomposition has been determined. We derive a cache-oblivious heuristic that minimizes cache misses at the sub-domain level through a quasi-cache-directed analysis to predict families of high performance domain decompositions in structured 3-D grids. To the best of our knowledge this is the first work to optimize domain decompositions by analyzing cache misses -thus connecting single core parameters (i.e. cache-misses) to true multicore parameters (i.e. domain decomposition). We analyze the trade-offs in decreasing cache-misses through such decompositions and increasing the dynamic bandwidth-per-core. The limitation of our work is that currently, it is applicable only to structured 3-D grids with cuts parallel to the Cartesian Axes. We emphasize and conclude that there is an imperative need to re-think domain decompositions in this constantly evolving multicore era.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Or the Finite Element Methods (Fem)mentioning

confidence: 99%

Section: Or the Finite Element Methods (Fem)mentioning

confidence: 99%

See 1 more Smart Citation

A cache-aware approach to domain decomposition for stencil-based codes

Saxena

Jimack

Walkley

2016

2016 International Conference on High Performance Computing &Amp; Simulation (HPCS)

View full text Add to dashboard Cite

show abstract

“…To illustrate the benefits of CMS, we focus on stencil algorithms because of their broad applicability, the memory bandwidth sensitivity of their kernels [36,18,12,1], and their ubiquitous usage [55]. In particular, stencil algorithms constitute a large fraction of consumer, embedded, HPC and scientific applications in such diverse areas as image processing, seismic imaging [46], heat diffusion, electromagnetics, fluid dynamics, and climate modeling [51,52,78,56]. These applications often use iterative finite-difference techniques, which sweep over a spatial grid, performing nearest neighbor computations called stencils.…”

Section: Stencil Computationsmentioning

confidence: 99%

“…In a stencil operation, each point in a multi-dimensional grid is updated with weighted contributions from a subset of its neighbors in both time and space, thereby representing the coefficients of the partial differential equation (PDE) for that data element. Stencil sizes range from considering only its immediate neighbors to 9-, 13-, 21-and 27-point stencils [14,11,78,56]. Stencil calculations perform global sweeps through data structures that are typically much larger than the available data caches.…”

Section: Stencil Computationsmentioning

confidence: 99%

Collective Memory Transfers for Multi-Core Chips

Williams¹,

Shalf²

2013

View full text Add to dashboard Cite

TOAST: Automatic tiling for iterative stencil computations on GPUs

Rocha

Pereira

Ramos

et al. 2017

Concurrency and Computation

View full text Add to dashboard Cite

Summary The stencil pattern is important in many scientific and engineering domains, spurring great interest from researchers and industry. In recent years, various optimizations have been proposed for parallel stencil applications running on graphics processing units (GPUs). In particular, tiling is a technique that can significantly enhance application performance by improving data locality and by reducing the volume of communication between host memory and GPU. In addition, tiling enables stencil applications to process inputs that are larger than the physical GPU memory. However, implementing tiling efficiently is complex, time‐consuming, and error‐prone. In this paper, we propose transparently optimized automatic stencil tiling (TOAST), an automatic tiling mechanism for iterative stencil computations running on GPUs; TOAST has 3 main benefits: (1) It incorporates an optimization model that seeks to maximize data reuse within tiles while respecting the amount of dynamically available GPU memory; (2) it offers a virtualized GPU memory for stencil computations, allowing for large input data; and (3) it performs optimal tiling transparently to the developer of the parallel stencil application. The current implementation of TOAST augments the PSkel framework with an internal solver based on genetic algorithms. Our experimental results show that TOAST improves the performance of iterative stencil applications by up to 13 × compared with their multithreaded (central processing unit–based) optimized versions and up to 48 × compared with a naive tiling approach on GPU. The TOAST mechanism is able to automatically achieve a low percentual overhead of data management compared with actual stencil computation.

show abstract

Understanding stencil code performance on multicore architectures

Cited by 28 publications

References 27 publications

A cache-aware approach to domain decomposition for stencil-based codes

A cache-aware approach to domain decomposition for stencil-based codes

Collective Memory Transfers for Multi-Core Chips

TOAST: Automatic tiling for iterative stencil computations on GPUs

Contact Info

Product

Resources

About