A Case Study of Porting HPGMG from CUDA to OpenMP Target Offload

Daley, Chris; Ahmed, Hadia; Williams, Samuel; Wright, Nicholas J.

doi:10.1007/978-3-030-58144-2_3

Cited by 12 publications

(5 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Detailed analysis of OpenMP 4.5 supported by different compilers show runtime overheads during the testing of different features [28]. More recently [11], three compilers supporting OpenMP directives for offloading tested discrete GPU compute capabilities, and runtime overheads in LLVM/Clang were identified with suggestions for manual implementation of acc_attach to create data structure on device and find association between host and device addresses. Similarly to using HIP/CUDA programming models, in directive-based programming the data management challenges have been one of the major hurdles in extending the applicability of OpenMP GPU offloading from benchmarks to full-scale production codes.…”

Section: Related Workmentioning

confidence: 99%

“…In more recent releases [9], new features to manage memory on heterogenous systems have been added with full support for accelerator devices. Increasing compiler support and optimizations have enabled numerous case studies and user experiences of OpenMP target offloading of inhouse applications [10], mini-apps [11], and benchmarks [12]. However, the simplicity of the example codes presented in these case studies often creates a challenge when translating and implementing OpenMP target offloading in productionready applications.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Porting HPC Applications to AMD Instinct™ MI300A using Unified Memory and OpenMP®

Tandon,

Grinberg,

Bercea

et al. 2024

ISC High Performance 2024 Research Paper Proceedings (39th International Conference)

View full text Add to dashboard Cite

Instinct TM MI300A is the world's first data center accelerated processing unit (APU) with memory shared between the AMD "Zen 4" EPYC TM cores and third generation CDNA TM compute units. A single memory space offers several advantages: i) it eliminates the need for data replication and costly data transfers, ii) it substantially simplifies application development and allows an incremental acceleration of applications, iii) is easy to maintain, and iv) its potential can be well realized via the abstractions in the OpenMP® 5.2 standard, where the host and the device data environments can be unified in a more performant way. In this article, we provide a blueprint of the APU programming model leveraging unified memory and highlight key distinctions compared to the conventional approach with discrete GPUs. OpenFOAM®, an open-source C++ library for computational fluid dynamics, is presented as a case study to emphasize the flexibility and ease of offloading a full-scale production-ready application on MI300 APUs using directivebased OpenMP programming.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Porting HPC Applications to AMD Instinct™ MI300A using Unified Memory and OpenMP®

Tandon,

Grinberg,

Bercea

et al. 2024

ISC High Performance 2024 Research Paper Proceedings (39th International Conference)

View full text Add to dashboard Cite

show abstract

“…However, research efforts have focused on understanding how OpenMP can be used in a less architecture specific manner to improve portability [10]. Performance gaps between optimized CUDA code and directive-based versions have been narrowing in recent years, for both OpenACC and OpenMP [2], [3], [20], paving the way for more large applications to invest time in supporting a directive-based version or to chose a directivebased offloading strategy as the primary programming model for GPU support, such as was recently done for the widelyused materials program VASP, 7 and the polarizable molecular dynamics program Tinker-HP [1].…”

Section: B Directives For Gpu Offloadingmentioning

confidence: 99%

“…Numerous studies have focused on directive-based offloading as a solution for performance portability over the past decade [3], [5], [9], [10], [12], [13], [21]. Multiple reports compared the OpenACC and OpenMP approaches and explored difference in usage and performance on GPUs compared to CUDA, and for CPU-based threading and accelerators like the Xeon Phi [2], [5], [10], [12], using simplified kernels and miniapps.…”

Section: Related Workmentioning

confidence: 99%

Portability for GPU-accelerated molecular docking applications for cloud and HPC: can portable compiler directives provide performance across all platforms?

Thavappiragasam¹,

Elwasif²,

Sedova³

2022

Preprint

View full text Add to dashboard Cite

High-throughput structure-based screening of druglike molecules has become a common tool in biomedical research. Recently, acceleration with graphics processing units (GPUs) has provided a large performance boost for molecular docking programs. Both cloud and high-performance computing (HPC) resources have been used for large screens with molecular docking programs; while NVIDIA GPUs have dominated cloud and HPC resources, new vendors such as AMD and Intel are now entering the field, creating the problem of software portability across different GPUs. Ideally, software productivity could be maximized with portable programming models that are able to maintain high performance across architectures. While in many cases compiler directives have been used as an easy way to offload parallel regions of a CPU-based program to a GPU accelerator, they may also be an attractive programming model for providing portability across different GPU vendors, in which case the porting process may proceed in the reverse direction: from low-level, architecture-specific code to higherlevel directive-based abstractions. MiniMDock is a new miniapplication (miniapp) designed to capture the essential computational kernels found in molecular docking calculations, such as are used in pharmaceutical drug discovery efforts, in order to test different solutions for porting across GPU architectures. Here we extend MiniMDock to GPU offloading with OpenMP directives, and compare to performance of kernels using CUDA, and HIP on both NVIDIA and AMD GPUs, as well as across different compilers, exploring performance bottlenecks. We document this reverse-porting process, from highly optimized device code to a higher-level version using directives, compare code structure, and describe barriers that were overcome in this effort.

show abstract

“…However, it has recently been extended with improved offloading functionality that allows the compiler to offload certain parts of an application to accelerators such as GPUs and FPGAs. Consequently, OpenMP now can target both CPUs and GPUs, which offers better portability than vendor-specific approaches such as CUDA [36].…”

Section: Openmpmentioning

confidence: 99%

A GPU-Based Kalman Filter for Track Fitting

Mania

Gray

et al. 2021

Comput Softw Big Sci

View full text Add to dashboard Cite

Computing centres, including those used to process High-Energy Physics data and simulations, are increasingly providing significant fractions of their computing resources through hardware architectures other than x86 CPUs, with GPUs being a common alternative. GPUs can provide excellent computational performance at a good price point for tasks that can be suitably parallelized. Charged particle (track) reconstruction is a computationally expensive component of HEP data reconstruction, and thus needs to use available resources in an efficient way. In this paper, an implementation of Kalman filter-based track fitting using CUDA and running on GPUs is presented. This utilizes the ACTS (A Common Tracking Software) toolkit; an open source and experiment-independent toolkit for track reconstruction. The implementation details and parallelization approach are described, along with the specific challenges for such an implementation. Detailed performance benchmarking results are discussed, which show encouraging performance gains over a CPU-based implementation for representative configurations. Finally, a perspective on the challenges and future directions for these studies is outlined. These include more complex and realistic scenarios which can be studied, and anticipated developments to software frameworks and standards which may open up possibilities for greater flexibility and improved performance.

show abstract

A Case Study of Porting HPGMG from CUDA to OpenMP Target Offload

Cited by 12 publications

References 25 publications

Porting HPC Applications to AMD Instinct™ MI300A using Unified Memory and OpenMP®

Porting HPC Applications to AMD Instinct™ MI300A using Unified Memory and OpenMP®

Portability for GPU-accelerated molecular docking applications for cloud and HPC: can portable compiler directives provide performance across all platforms?

A GPU-Based Kalman Filter for Track Fitting

Contact Info

Product

Resources

About