Designing OP2 for GPU architectures

Giles, Michael B.; Mudalige, Gihan R.; Spencer, Ben F.; Bertolli, Carlo; Reguly, István Z.

doi:10.1016/j.jpdc.2012.07.008

Cited by 28 publications

(39 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The ability to adapt to the rapidly changing hardware landscape motivated the development of OP2 [8], [9], a successor to OPlus. While the initial motivation was to enable Hydra to exploit multi-core and many-core parallelism, OP2 was designed from the outset to be a general high-level active library framework to express and parallelize unstructured mesh based numerical computations.…”

Section: Op2 Library For Unstructured Gridsmentioning

confidence: 99%

“…OP2 holds them internally as C arrays and it is able to apply optimizing transformations in how the data is held in memory. Transformations include reordering mesh elements [16], partitioning (under MPI) and conversion to an array-ofstructs data layout (for GPUs [9]). These transformations, and OP2's ability to seamlessly apply them internally is key to achieving a number of performance optimizations.…”

Section: Development and Code Generation With Op2mentioning

confidence: 99%

“…OP2 ( [8], [9], [10], [11]) is the successor of OPlus, and adopts an active-library approach; a single application code written using the OP2 API can be transformed (through source-to-source translation tools) into multiple parallel implementations which can then be linked against the appropriate parallel library (e.g. OpenMP, CUDA, MPI, OpenCL etc.)…”

Section: Introductionmentioning

confidence: 99%

“…The generated code and the OP2 platform specific back-end libraries are highly optimized utilizing the best low-level features of a target architecture to make an OP2 application achieve high performance including high computational efficiency and minimized memory traffic. In previous works, we have presented OP2's design and development [8], [9] and its performance on heterogeneous systems [10], [12] on simpler benchmarks, and demonstrated considerable performance gains on a diverse set of hardware.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Acceleration of a Full-Scale Industrial CFD Application with OP2

Reguly

Mudalige

Bertolli

et al. 2016

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Hydra is a full-scale industrial CFD application used for the design of turbomachinery at Rolls Royce plc. It consists of over 300 parallel loops with a code base exceeding 50K lines and is capable of performing complex simulations over highly detailed unstructured mesh geometries. Unlike simpler structured-mesh applications, which feature high speed-ups when accelerated by modern processor architectures, such as multi-core and many-core processor systems, Hydra presents major challenges in data organization and movement that need to be overcome for continued high performance on emerging platforms. We present research in achieving this goal through the OP2 domain-specific high-level framework. OP2 targets the domain of unstructured mesh problems and follows the design of an active library using source-to-source translation and compilation to generate multiple parallel implementations from a single high-level application source for execution on a range of back-end hardware platforms. We chart the conversion of Hydra from its original hand-tuned production version to one that utilizes OP2, and map out the key difficulties encountered in the process. To our knowledge this research presents the first application of such a high-level framework to a full scale production code. Specifically we show (1) how different parallel implementations can be achieved with an active library framework, even for a highly complicated industrial application such as Hydra, and (2) how different optimizations targeting contrasting parallel architectures can be applied to the whole application, seamlessly, reducing developer effort and increasing code longevity. Performance results demonstrate that not only the same runtime performance as that of the hand-tuned original production code could be achieved, but it can be significantly improved on conventional processor systems. Additionally, we achieve further acceleration by exploiting many-core parallelism, particularly on GPU systems. Our results provide evidence of how high-level frameworks such as OP2 enable portability across a wide range of contrasting platforms and their significant utility in achieving near-optimal performance without the intervention of the application programmer.

show abstract

Section: Op2 Library For Unstructured Gridsmentioning

confidence: 99%

Section: Development and Code Generation With Op2mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Acceleration of a Full-Scale Industrial CFD Application with OP2

Reguly

Mudalige

Bertolli

et al. 2016

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…Results presented in previous papers [4,6] report pure MPI and hybrid MPI+OpenMP performance on clusters of CPUs as well as MPI+CUDA performance results running on Fermi-generation NVIDIA GPUs. However, as newer hardware generations become available, it is necessary to revise optimization techniques due to changing performance characteristics and best practices; for example the Kepler generation of GPUs features a much higher number of cores per Scalar Multiprocessor (SMX) than the Fermi generation, but the amount of shared memory available remains unchanged.…”

Section: Optimizations To Existing Backendsmentioning

confidence: 99%

Vectorizing unstructured mesh computations for many‐core architectures

Reguly

Laszlo

Mudalige

et al. 2015

Concurrency and Computation

View full text Add to dashboard Cite

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector processing constructs, they are only effective on a few classes of applications with regular memory access and computational patterns. Irregular application classes require the explicit use of parallel programming models; CUDA and OpenCL are well established for programming GPUs, but it is not obvious what model to use to exploit vector units on architectures such as CPUs or the Xeon Phi. Therefore it is of growing interest what programming models are available, such as Single Instruction Multiple Threads (SIMT) or Single Instruction Multiple Data (SIMD), and how they map to vector units. This paper presents results on achieving high performance through vectorization on CPUs and the Xeon Phi on a key class of applications: unstructured mesh computations. By exploring the SIMT and SIMD execution and parallel programming models, we show how abstract unstructured grid computations map to OpenCL or vector intrinsics through the use of code generation techniques, and how these in turn utilize the hardware.We benchmark a number of systems, including Intel Xeon CPUs and the Intel Xeon Phi, using an industrially representative CFD application and compare the results against previous work on CPUs and NVIDIA GPUs to provide a contrasting comparison of what could be achieved on current many-core systems. By carrying out a performance analysis study, we identify key performance bottlenecks due to computational, control and bandwidth limitations.We show that the OpenCL SIMT model does not map efficiently to CPU vector units due to auto-vectorization issues and threading overheads. We demonstrate that while the use of SIMD vector intrinsics imposes some restrictions, and requires more involved programming techniques, it does result in efficient code and near-optimal performance, that Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. is up to 2 times faster than the non-vectorized code. We observe that the Xeon Phi does not provide good performance for this class of applications, but is still on par with a pair of high-end Xeon chips. CPUs and GPUs do saturate the available resources, giving performance very near to the optimum.

show abstract

PENNANT: an unstructured mesh mini‐app for advanced architecture research

Ferenbaugh

2014

Concurrency and Computation

View full text Add to dashboard Cite

SUMMARY This paper describes PENNANT, a mini‐app that operates on general unstructured meshes (meshes with arbitrary polygons), and is designed for advanced architecture research. It contains mesh data structures and physics algorithms adapted from the Los Alamos National Laboratory radiation‐hydrodynamics code FLAG and gives a sample of the typical memory access patterns of FLAG. The basic capabilities and optimization approaches of PENNANT are presented. Results are shown from sample performance experiments run on serial, multicore, and graphics processing unit implementations, giving an indication of how PENNANT can be a useful tool for studies of new architectures and programming models. Copyright © 2014 John Wiley & Sons, Ltd.

show abstract

Designing OP2 for GPU architectures

Cited by 28 publications

References 10 publications

Acceleration of a Full-Scale Industrial CFD Application with OP2

Acceleration of a Full-Scale Industrial CFD Application with OP2

Vectorizing unstructured mesh computations for many‐core architectures

PENNANT: an unstructured mesh mini‐app for advanced architecture research

Contact Info

Product

Resources

About