An Optimizing Code Generator for a Class of Lattice-Boltzmann Computations

Pananilath, Irshad; Acharya, Aravind; Vasista, Vinay; Bondhugula, Uday

doi:10.1145/2739047

Cited by 16 publications

(10 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There has been a great amount of work [11,13,22,25,33,34] reported on the evaluation of diamond tiling. It was also generalized to handle iterated stencils defined over periodic data domains with index set splitting [5] and the Lattice-Boltzmann method [26]. Unlike overlapped tiling and split tiling, diamond tiling may work with arbitrary affine dependences.…”

Section: Related Workmentioning

confidence: 99%

Flextended Tiles

Zhao

Cohen²

2019

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Loop tiling to exploit data locality and parallelism plays an essential role in a variety of general-purpose and domain-specific compilers. Affine transformations in polyhedral frameworks implement classical forms of rectangular and parallelogram tiling, but these lead to pipelined start with rather inefficient wavefront parallelism. Multiple extensions to polyhedral compilers evaluated sophisticated shapes such as trapezoid or diamond tiles, enabling concurrent start along the axes of the iteration space; yet these resort to custom schedulers and code generators insufficiently integrated within the general framework. One of these modified shapes referred to as overlapped tiling also lacks a unifying framework to reason about its composition with affine transformations; this prevents its application in general-purpose loop-nest optimizers and the fair comparison with other techniques. We revisit overlapped tiling, recasting it as an affine transformation on schedule trees composable with any affine scheduling algorithm. We demonstrate how to derive tighter tile shapes with less redundant computations. Our method models the traditional "scalene trapezoid" shapes and novel "right-rectangle" variants. It goes beyond the state of the art by avoiding the restriction to a domain-specific language or introducing post-pass rescheduling and custom code generation. We conduct experiments on the PolyMage benchmarks and iterated stencils, validating the effectiveness and applicability of our technique on both general-purpose multicores and GPU accelerators. CCS Concepts: • Software and its engineering → Compilers;

show abstract

Section: Related Workmentioning

confidence: 99%

Flextended Tiles

Zhao

Cohen²

2019

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…The stencil benchmarks were optimized for cache locality using the Pluto heuristic [Pluto 2008]. The unsharp-mask and harris-corner kernels were taken from PolyMage [Mullapudi et al 2015] while the LBM benchmarks are due to the work of [Pananilath et al 2015]. Note that in all cases where we perform tiling, tile sizes Table I.…”

Section: Implementation and Practical Impactmentioning

confidence: 99%

Automatic Storage Optimization for Arrays

Bhaskaracharya

Bondhugula

Cohen

2016

ACM Trans. Program. Lang. Syst.

Self Cite

View full text Add to dashboard Cite

Efficient memory allocation is crucial for data-intensive applications as a smaller memory footprint ensures better cache performance and allows one to run a larger problem size given a fixed amount of main memory. In this paper, we describe a new automatic storage optimization technique to minimize the dimensionality and storage requirements of arrays used in sequences of loop nests with a predetermined schedule. We formulate the problem of intra-array storage optimization as one of finding the right storage partitioning hyperplanes: each storage partition corresponds to a single storage location. Our heuristic is driven by a dual objective function that minimizes both, the dimensionality of the mapping and the extents along those dimensions. The technique is dimension optimal for most codes encountered in practice. The storage requirements of the mappings obtained also are asymptotically better than those obtained by any existing schedule-dependent technique. Storage reduction factors and other results we report from an implementation of our technique demonstrate its effectiveness on several real-world examples drawn from the domains of image processing, stencil computations, high-performance computing, and the class of tiled codes in general.

show abstract

“…Their results show excellent weak and strong scalability with 8192 CPU cores. Pananilath I. et al developed an automated code generator for LBM simulations, featuring optimization techniques of tiling, load balancing, SIMD, etc. to boost LBM codes' performance.…”

Section: Related Workmentioning

confidence: 99%

Parallelizing and optimizing large‐scale 3D multi‐phase flow simulations on the Tianhe‐2 supercomputer

Wang

et al. 2015

Concurrency and Computation

View full text Add to dashboard Cite

Summary The lattice Boltzmann method (LBM) is a widely used computational fluid dynamics method for flow problems with complex geometries and various boundary conditions. Large‐scale LBM simulations with increasing resolution and extending temporal range require massive https://en.wikipedia.org/wiki/High-performance_computing#High-performance computing (HPC) resources, thus motivating us to port it onto modern many‐core heterogeneous supercomputers like Tianhe‐2. Although many‐core accelerators such as graphics processing unit and Intel MIC have a dramatic advantage of floating‐point performance and power efficiency over CPUs, they also pose a tough challenge to parallelize and optimize computational fluid dynamics codes on large‐scale heterogeneous system. In this paper, we parallelize and optimize the open source 3D multi‐phase LBM code openlbmflow on the Intel Xeon Phi (MIC) accelerated Tianhe‐2 supercomputer using a hybrid and heterogeneous MPI+OpenMP+Offload+single instruction, mulitple data (SIMD) programming model. With cache blocking and SIMD‐friendly data structure transformation, we dramatically improve the SIMD and cache efficiency for the single‐thread performance on both CPU and Phi, achieving a speedup of 7.9X and 8.8X, respectively, compared with the baseline code. To collaborate CPUs and Phi processors efficiently, we propose a load‐balance scheme to distribute workloads among intra‐node two CPUs and three Phi processors and use an asynchronous model to overlap the collaborative computation and communication as far as possible. The collaborative approach with two CPUs and three Phi processors improves the performance by around 3.2X compared with the CPU‐only approach. Scalability tests show that openlbmflow can achieve a parallel efficiency of about 60% on 2048 nodes, with about 400K cores in total. To the best of our knowledge, this is the largest scale CPU‐MIC collaborative LBM simulation for 3D multi‐phase flow problems. Copyright © 2015 John Wiley & Sons, Ltd.

show abstract

An Optimizing Code Generator for a Class of Lattice-Boltzmann Computations

Cited by 16 publications

References 29 publications

Flextended Tiles

Flextended Tiles

Automatic Storage Optimization for Arrays

Parallelizing and optimizing large‐scale 3D multi‐phase flow simulations on the Tianhe‐2 supercomputer

Contact Info

Product

Resources

About