Tessellating stencils

Yuan, Liang; Zhang, Yunquan; Guo, Peng; Huang, Shan

doi:10.1145/3126908.3126920

Cited by 20 publications

(38 citation statements)

References 68 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most of these references rely on SB with an implementation based on the BSP model. To increase data locality, TB through tiling techniques (Bandishti et al, 2012;Grosser et al, 2014b;Malas et al, 2015;Orozco and Gao, 2009;Strzodka et al, 2011;Wellein et al, 2009;Wonnacott, 2000;Yuan et al, 2017;Zhou, 2013) has been widely considered using various advanced programming models to favor asynchronous execution. Performance tuning using roofline models (Datta, 2009;Etienne et al, 2017;Nguyen et al, 2010;Titarenko and Hildyard, 2017) remains an important assessment step for stencil computations to ensure a good utilization of the underlying hardware resources.…”

Section: Prior Work and Current Contributionsmentioning

confidence: 99%

See 1 more Smart Citation

Asynchronous computations for solving the acoustic wave propagation equation

Akbudak

Ltaief

Etienne

et al. 2020

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

The aim of this study is to design and implement an asynchronous computational scheme for solving the acoustic wave propagation equation with absorbing boundary conditions (ABCs) in the context of seismic imaging applications. While the convolutional perfectly matched layer (CPML) is typically used for ABCs in the oil and gas industry, its formulation further stresses memory accesses and decreases the arithmetic intensity at the physical domain boundaries. The challenges with CPML are twofold: (1) the strong, inherent data dependencies imposed on the explicit time-stepping scheme render asynchronous time integration cumbersome and (2) the idle time is further exacerbated by the load imbalance introduced among processing units. In fact, the CPML formulation of the ABCs requires expensive synchronization points, which may hinder the parallel performance of the overall asynchronous time integration. In particular, when deployed in conjunction with the multicore-optimized wavefront diamond temporal blocking (MWD-TB) approach for the inner domain points, it results in a major performance slow down. To relax CPML’s synchrony and mitigate the resulting load imbalance, we embed CPML’s calculation into MWD-TB’s inner loop and carry on the time integration with fine-grained computations in an asynchronous, holistic way. This comes at the price of storing transient results to alleviate dependencies from critical data hazards while maintaining the numerical accuracy of the original scheme. Performance and scalability results on various x86 architectures demonstrate the superiority of MWD-TB with CPML support against the standard spatial blocking on various grid sizes. To our knowledge, this is the first practical study that highlights the consolidation of CPML ABCs with asynchronous temporal blocking stencil computations.

show abstract

Section: Prior Work and Current Contributionsmentioning

confidence: 99%

“…Christen et al, 2011;Malas, 2015;Tang et al, 2011). More thorough related work can be found at Malas et al (2017) and Yuan et al (2017).…”

Section: Prior Work and Current Contributionsmentioning

confidence: 99%

Asynchronous computations for solving the acoustic wave propagation equation

Akbudak

Ltaief

Etienne

et al. 2020

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

show abstract

“…With the development of advanced vector instruction sets, there have been many research studies addressing the challenges of a faster computational stencil kernels on homogeneous x86 and GPU-based systems using spatial [5]- [15] or temporal blocking [16]- [26]. These stencil kernel optimizations are key components to the Reverse Time Migration (RTM), but they usually rely on simple boundary conditions (e.g., Dirichlet) and do not consider the full RTM ecosystem and specifications.…”

Section: Related Workmentioning

confidence: 99%

Asynchronous Task-Based Execution of the Reverse Time Migration for the Oil and Gas Industry

AlOnazi

Ltaief

Keyes

et al. 2019

2019 IEEE International Conference on Cluster Computing (CLUSTER)

View full text Add to dashboard Cite

We propose a new framework for deploying Reverse Time Migration (RTM) simulations on distributed-memory systems equipped with multiple GPUs. Our software, TB-RTM, infrastructure engine relies on the STARPU dynamic runtime system to orchestrate the asynchronous scheduling of RTM computational tasks on the underlying resources. Besides dealing with the challenging hardware heterogeneity, TB-RTM supports tasks with different workload characteristics, which stress disparate components of the hardware system. RTM is challenging in that it operates intensively at both ends of the memory hierarchy, with compute kernels running at the highest level of the memory system, possibly in GPU main memory, while I/O kernels are saving solution data to fast storage. We consider how to span the wide performance gap between the two extreme ends of the memory system, i.e., GPU memory and fast storage, on which large-scale RTM simulations routinely execute. To maximize hardware occupancy while maintaining high memory bandwidth throughout the memory subsystem, our framework presents the new out-of-core (OOC) feature from STARPU to prefetch data solutions in and out not only from/to the GPU/CPU main memory but also from/to the fast storage system. The OOC technique may trigger opportunities for overlapping expensive data movement with computations. TB-RTM framework addresses this challenging problem of heterogeneity with a systematic approach that is oblivious to the targeted hardware architectures. Our resulting RTM framework can effectively be deployed on massively parallel GPU-based systems, while delivering performance scalability up to 500 GPUs.

show abstract

“…It is extensively involved in various domains from physical simulations to machine learning [8,26,36]. Stencil is also included as one of the seven computational motifs presented in the Berkeley View [3,4,51] and arises as a principal class of floating-point kernels in high-performance computing.…”

Section: Introductionmentioning

confidence: 99%

“…The naive implementation for a 𝑑-dimensional stencil contains 𝑑 +1 loops where the time dimension is traversed in the outmost loop and all grid points are updated in inner loops. Since stencil is characterized by this regular computational structure, it is inherently a bandwidth-bound kernel with a low arithmetic intensity and poor data reuse [24,51].…”

Section: Introductionmentioning

confidence: 99%

Reducing Redundancy in Data Organization and Arithmetic Calculation for Stencil Computations

Li¹,

Yuan²,

Zhang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Stencil computation is one of the most important kernels in various scientific and engineering applications. A variety of work has focused on vectorization techniques, aiming at exploiting the in-core data parallelism. Briefly, they either incur data alignment conflicts or hurt the data locality when integrated with tiling. In this paper, a novel transpose layout is devised to preserve the data locality for tiling in the data space and reduce the data reorganization overhead for vectorization simultaneously. We then propose an approach of temporal computation folding designed to further reduce the redundancy of arithmetic calculations by exploiting the register reuse, alleviating the increased register pressure, and deducing generalization with a linear regression model. Experimental results on the AVX-2 and AVX-512 CPUs show that our approach obtains a competitive performance.

show abstract

Tessellating stencils

Cited by 20 publications

References 68 publications

Asynchronous computations for solving the acoustic wave propagation equation

Asynchronous computations for solving the acoustic wave propagation equation

Asynchronous Task-Based Execution of the Reverse Time Migration for the Oil and Gas Industry

Reducing Redundancy in Data Organization and Arithmetic Calculation for Stencil Computations

Contact Info

Product

Resources

About