Early Experiences Porting Three Applications to OpenMP 4.5

Abstract. One of the key requirements for the Lattice QCD Application Development as part of the US Exascale Computing Project is performance portability across multiple architectures. Using the Grid C++ expression template as a starting point, we report on the progress made with regards to the Grid GPU offloading strategies. We present both the successes and issues encountered in using CUDA, OpenACC and Just-In-Time compilation. Experimentation and performance on GPUs with a SU(3)×SU(3) streaming test will be reported. We will also report on the challenges of using current OpenMP 4.x for GPU offloading in the same code.

show abstract

“…However, OpenMP faces more challenges than OpenACC for C++ code due to the lack of UVM support in the current compiler implementations [6,7].…”

Section: Openmp 45mentioning

confidence: 43%

Performance Portability Strategies for Grid C++ Expression Templates

et al. 2018

View full text Add to dashboard Cite

show abstract

“…Starting with version 4.0, OpenMP is capable of offloading computations to GPUs thus raising performance challenges for both on-device computation and host-device communication. Some of the early experiences with OpenMP are outlined by Karlin et al [8] and Vergara Larrea et al [9]. For some time, the OpenMP standard has required explicit handling of data between host and device using maps.…”

Section: Related Workmentioning

confidence: 99%

Pointers Inside Lambda Closure Objects in OpenMP Target Offload Regions

Truby

Bertolli

Wright

et al. 2018

2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC)

Self Cite

View full text Add to dashboard Cite

With the diversification of HPC architectures beyond traditional CPU-based clusters, a number of new frameworks for performance portability across architectures have arisen. One way of implementing such frameworks is to use C++ templates and lambda expressions to design loop-like functions. However, lower level programming APIs that these implementations must use are often designed with C in mind and do not specify how they interact with C++ features such as lambda expressions. This paper discusses a change to the behavior of the OpenMP specification with respect to lambda expressions such that when functions generated by lambda expressions are called inside GPU regions, any pointers used in the lambda expression correctly refer to device pointers. This change has been implemented in a branch of the Clang C++ compiler and demonstrated with two representative codes. This change has also been accepted into the draft OpenMP ® specification for inclusion in OpenMP 5. Our results show that the implicit mapping of lambda expressions always exhibits identical performance to an explicit mapping but without breaking the abstraction provided by the high level frameworks.

show abstract

“…An annual hackathon event for the improvement of OpenMP is hosted by IBM, where a live porting exercise is performed involving multiple US labs [11]. As an outcome of the 2015 hackathon, Karlin et al [12] ported the applications Kripke, Cardioid, and LULESH to OpenMP, demonstrating some issues with the interoperability with some features of C++, and achieving performance with LULESH within 10% of an equivalent CUDA port.…”

Section: Related Workmentioning

confidence: 99%

The Productivity, Portability and Performance of OpenMP 4.5 for Scientific Applications Targeting Intel CPUs, IBM CPUs, and NVIDIA GPUs

Martineau

McIntosh–Smith

2017

Scaling OpenMP for Exascale Performance and Portability

View full text Add to dashboard Cite

This research considers the productivity, portability, and performance offered by the OpenMP parallel programming model, from the perspective of scientific applications. We discuss important considerations for scientific application developers tackling large software projects with OpenMP, including straightforward code mechanisms to improve productivity and portability. Performance results are presented across multiple modern HPC devices, including Intel Xeon, and Xeon Phi CPUs, POWER8 CPUs, and NVIDIA GPUs. The results are collected for three exemplar applications: hydrodynamics, heat conduction and neutral particle transport, using modern compilers with OpenMP support. The results show that while current OpenMP implementations are able to achieve good performance on the breadth of modern hardware for memory bandwidth bound applications, our memory latency bound application performs less consistently.

show abstract

Early Experiences Porting Three Applications to OpenMP 4.5

Cited by 18 publications

References 10 publications

Performance Portability Strategies for Grid C++ Expression Templates

Performance Portability Strategies for Grid C++ Expression Templates

Pointers Inside Lambda Closure Objects in OpenMP Target Offload Regions

The Productivity, Portability and Performance of OpenMP 4.5 for Scientific Applications Targeting Intel CPUs, IBM CPUs, and NVIDIA GPUs

Contact Info

Product

Resources

About