Loop Chaining: A Programming Abstraction for Balancing Locality and Parallelism

Krieger, Christopher D.; Strout, Michelle Mills; Olschanowsky, Catherine; Stone, Andrew; Guzik, Stephen M.; Gao, Xinfeng; Bertolli, Carlo; Kelly, Paul H. J.; Mudalige, Gihan R.; Straalen, Brian Van; Williams, Samuel

doi:10.1109/ipdpsw.2013.68

Cited by 15 publications

(13 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The performance data on bandwidths in Tables 4 and 5 show that there is only little scope for additional speedup. One avenue we are currently investigating for obtaining further performance is to implement sparse/overlapped tiling [63,64]. The idea is to employ communications avoiding algorithms such that we reduce the bottleneck due to memory bandwidth.…”

Section: Single Node Performancementioning

confidence: 99%

Design and initial performance of a high-level unstructured mesh framework on heterogeneous parallel systems

et al. 2013

View full text Add to dashboard Cite

Section: Single Node Performancementioning

confidence: 99%

Design and initial performance of a high-level unstructured mesh framework on heterogeneous parallel systems

et al. 2013

View full text Add to dashboard Cite

“…The loop chain is an abstraction introduced in [21]. Informally, a loop chain is a sequence of loops with no global synchronization points, enriched with information to enable run-time data dependence analysis -necessary since indirect memory accesses inhibit common static approaches to loop optimization.…”

Section: The Loop Chain Abstraction For Unstructured Mesh Applicationsmentioning

confidence: 99%

“…The data dependence analysis that we have developed in this article is based on the loop chain abstraction, which was originally presented in [21]. This abstraction is su ciently general to capture data dependencies in programs structured as arbitrary sequences of loops, particularly to create inspector/executor schemes for many unstructured mesh application.…”

Section: Related Workmentioning

confidence: 99%

Automated Tiling of Unstructured Mesh Computations with Application to Seismological Modeling

Luporini

Lange

Jacobs

et al. 2019

ACM Trans. Math. Softw.

View full text Add to dashboard Cite

Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have di erent iteration spaces and access shared datasets through indirect memory accesses, such as A[map [i]] -hence the name "sparse". One notable example of such loops arises in discontinuous-Galerkin nite element methods, because of the computation of numerical integrals over di erent domains (e.g., cells, facets). The major challenge with sparse tiling is implementationnot only is it cumbersome to understand and synthesize, but it is also onerous to maintain and generalize, as it requires a complete rewrite of the bulk of the numerical computation. In this article, we propose an approach to extend the applicability of sparse tiling based on raising the level of abstraction. Through a sequence of compiler passes, the mathematical speci cation of a problem is progressively lowered, and eventually sparse-tiled C for-loops are generated. Besides automation, we advance the state-of-the-art by introducing: a revisited, more e cient sparse tiling algorithm; support for distributed-memory parallelism; a range of ne-grained optimizations for increased run-time performance; implementation in a publicly-available library, SLOPE; and an in-depth study of the performance impact in Seigen, a real-world elastic wave equation solver for seismological problems, which shows speed-ups up to 1.28× on a platform consisting of 896 Intel Broadwell cores.(or equivalent data structure), which leads to indirect memory accesses within the loop nests. Indirections break static analysis, thus making purely compiler-based approaches insu cient. Runtime data dependence analysis is essential for sparse tiling, so integration of compiler and run-time tracking algorithms becomes necessary. Realistic datasets not tting in a single node Real-world simulations often operate on terabytes of data, hence execution on multi-node systems is often required. We have extended the original sparse tiling algorithm to enable distributed-memory parallelism.Sparse tiling does not change the semantics of a numerical method -only the order in which some iterations are executed. Therefore, if most sections of a PDE solver su er from computational boundedness and standard optimizations such as vectorization have already been applied, then sparse tiling, which targets memory-boundedness, will only provide marginal bene ts (if any). Likewise, if a global reduction is present in between two loops, then there is no way for sparse tiling to be applied, unless the numerical method itself is rethought. This is regardless of whether the reduction is explicit (e.g., the rst loop updates a global variable that is read by the second loop) or implicit (i.e., within an external function, as occurs for example in most implicit nite element solvers). These are probably the two greatest limitations of the technique; otherwise, sparse tiling may provide substantial performance bene ts.The rest of the article is structured a...

show abstract

“…x y x y block halo halo area of dataset 1 Fig. 1: Halos in a multi-block setting introduces a lazy execution scheme used by OPS, and describes a number of optimizations that rely on the loop chaining abstraction [22]. Section IV gives an overview of the design choices of OPS discussing their benefits and drawbacks, and finally Section VI draws conclusion.…”

Section: Dataset 1 On Blockmentioning

confidence: 99%

The OPS Domain Specific Abstraction for Multi-block Structured Grid Computations

Reguly

Mudalige

Giles

et al. 2014

2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing

View full text Add to dashboard Cite

Code maintainability, performance portability and future proofing are some of the key challenges in this era of rapid change in High Performance Computing. Domain Specific Languages and Active Libraries address these challenges by focusing on a single application domain and providing a high-level programming approach, and then subsequently using domain knowledge to deliver high performance on various hardware.In this paper, we introduce the OPS high-level abstraction and active library aimed at multi-block structured grid computations, and discuss some of its key design points; we demonstrate how OPS can be embedded in C/C++ and the API made to look like a traditional library, and how through a combination of simple text manipulation and back-end logic we can enable execution on a diverse range of hardware using different parallel programming approaches.Relying on the access-execute description of the OPS abstraction, we introduce a number of automated execution techniques that enable distributed memory parallelization, optimization of communication patterns, checkpointing and cache-blocking. Using performance results from CloverLeaf from the Mantevo suite of benchmarks, we demonstrate the utility of OPS.

show abstract

Loop Chaining: A Programming Abstraction for Balancing Locality and Parallelism

Cited by 15 publications

References 41 publications

Design and initial performance of a high-level unstructured mesh framework on heterogeneous parallel systems

Design and initial performance of a high-level unstructured mesh framework on heterogeneous parallel systems

Automated Tiling of Unstructured Mesh Computations with Application to Seismological Modeling

The OPS Domain Specific Abstraction for Multi-block Structured Grid Computations

Contact Info

Product

Resources

About