OpenMP to FPGA Offloading Prototype Using OpenCL SDK

Knaust, Marius; Mayer, Florian; Steinke, Thomas

doi:10.1109/ipdpsw.2019.00072

Cited by 18 publications

(7 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…ExaHyPE, an Exascale Hyperbolic PDE design [30] used a pragma-based GPU parallelization approach for object-oriented code, and documented lessons learned. Several other related works include demonstrating GPU support for OpenMP offloading features in compilers in Flang/Clang [3,25], a proof-ofconcept implementation of offloading for FPGA based accelerators [14,26], and an interprocedural statical analysis heuristic at runtime to select optimal grid sizes for offloaded target team constructs [27], among others. There are publicly available benchmark suites to evaluate heterogeneous application performance, e.g.…”

Section: Related Workmentioning

confidence: 99%

Performance Assessment of OpenMP Compilers Targeting NVIDIA V100 GPUs

Davis

Daley

Pophale

et al. 2021

Accelerator Programming Using Directives

View full text Add to dashboard Cite

Heterogeneous systems are becoming increasingly prevalent. In order to exploit the rich compute resources of such systems, robust programming models are needed for application developers to seamlessly migrate legacy code from today's systems to tomorrow's. Over the past decade and more, directives have been established as one of the promising paths to tackle programmatic challenges on emerging systems. This work focuses on applying and demonstrating OpenMP offloading directives on five proxy applications. We observe that the performance varies widely from one compiler to the other; a crucial aspect of our work is reporting best practices to application developers who use OpenMP offloading compilers. While some issues can be worked around by the developer, there are other issues that must be reported to the compiler vendors. By restructuring OpenMP offloading directives, we gain an 18x speedup for the su3 proxy application on NERSC's Cori system when using the Clang compiler, and a 15.7x speedup by switching max reductions to add reductions in the laplace mini-app when using the Cray-llvm compiler on Cori.

show abstract

Section: Related Workmentioning

confidence: 99%

Performance Assessment of OpenMP Compilers Targeting NVIDIA V100 GPUs

Davis

Daley

Pophale

et al. 2021

Accelerator Programming Using Directives

View full text Add to dashboard Cite

show abstract

“…Knaust et al [29] use Clang [30] to outline omp target regions at the level of the LLVM IR, and feed them into Intel's OpenCL HLS tool-chain to generate a hardware kernel for the FPGA. Their approach uses Intel's OpenCL API to allow the communication between host and FPGA.…”

Section: Resource Utilizationmentioning

confidence: 99%

Enabling OpenMP Task Parallelism on Multi-FPGAs

Nepomuceno

Sterle

Valarini

et al. 2021

2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

View full text Add to dashboard Cite

FPGA-based hardware accelerators have received increasing attention mainly due to their ability to accelerate deep pipelined applications, thus resulting in higher computational performance and energy efficiency. Nevertheless, the amount of resources available on even the most powerful FPGA is still not enough to speed up very large modern workloads. To achieve that, FPGAs need to be interconnected in a Multi-FPGA architecture capable of accelerating a single application. However, programming such architecture is a challenging endeavor that still requires additional research. This paper extends the OpenMP task-based computation offloading model to enable a number of FPGAs to work together as a single Multi-FPGA architecture. Experimental results for a set of OpenMP stencil applications running on a Multi-FPGA platform consisting of 6 Xilinx VC709 boards interconnected through fiber-optic links have shown close to linear speedups as the number of FPGAs and IP-cores per FPGA increase.

show abstract

“…This typically leads to very high compile times and very low FPGA occupation and performance, since CPU-and GPU-optimized code is notably inefficient in the FPGA architectures. Further work by Knaust [13] and Huthmann [14] attack this problem in different ways. The first one opts to prototype the FPGA device with OpenCL and compiler-specific interfaces, requiring IR (Intermediate Representation) backporting to make use of the HLS system and OpenCL interfaces.…”

Section: Related Workmentioning

confidence: 99%

“…More flexible than [13,14] is the aforementioned Yviquel et al [10] work. It does not generate target binary code but rather a Scala implementation (as a Java runtime binary) to be ran on any Apache Spark cluster.…”

Section: Related Workmentioning

confidence: 99%

FOTV: A Generic Device Offloading Framework for OpenMP

Vázquez

Sánchez

2021

OpenMP: Enabling Massive Node-Level Parallelism

View full text Add to dashboard Cite

Since the introduction of the “target” directive in the 4.0 specification, the usage of OpenMP for heterogeneous computing programming has increased significantly. However, the compiler support limits its usage because the code for the accelerated region has to be generated in compile time. This restricts the usage of accelerator-specific design flows (e.g. FPGA hardware synthesis) and the support of new devices that typically requires extending and modifying the compiler itself.This paper explores a solution to these limitations: a generic device that is supported by the OpenMP compiler but whose functionality is defined at runtime. The generic device framework has been integrated in an OpenMP compiler (LLVM/Clang). It acts as a device type for the compiler and interfaces with the physical devices to execute the accelerated code. The framework has an API that provides support for new devices and accelerated code without additional OpenMP compiler modifications. It also includes a code generator that extracts the source code of OpenMP target regions for external compilation chains.In order to evaluate the approach, we present a new device implementation that allows executing OpenCL code as an OpenMP target region. We study the overhead that the framework produces and show that it is minimal and comparable to other OpenMP devices.

show abstract

OpenMP to FPGA Offloading Prototype Using OpenCL SDK

Cited by 18 publications

References 6 publications

Performance Assessment of OpenMP Compilers Targeting NVIDIA V100 GPUs

Performance Assessment of OpenMP Compilers Targeting NVIDIA V100 GPUs

Enabling OpenMP Task Parallelism on Multi-FPGAs

FOTV: A Generic Device Offloading Framework for OpenMP

Contact Info

Product

Resources

About