Empowering OpenMP with automatically generated hardware

Podobas, Artur; Brorsson, Mats

doi:10.1109/samos.2016.7818354

Cited by 12 publications

(2 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…OpenCL [12], OpenMP [38], CUDA [36], and even Java [4]. In this particular study, we use HLS as a method for creating a custom accelerator for the spectral element method.…”

Section: Field-programmable Gate Arraysmentioning

confidence: 99%

A High-Fidelity Flow Solver for Unstructured Meshes on Field-Programmable Gate Arrays

Karp,

Podobas,

Kenter

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

The impending termination of Moore's law motivates the search for new forms of computing to continue the performance scaling we have grown accustomed to. Among the many emerging Post-Moore computing candidates, perhaps none is as salient as the Field-Programmable Gate Array (FPGA), which offers the means of specializing and customizing the hardware to the computation at hand.In this work, we design a custom FPGA-based accelerator for a computational fluid dynamics (CFD) code. Unlike prior workwhich often focuses on accelerating small kernels -we target the entire unstructured Poisson solver based on the high-fidelity spectral element method (SEM) used in modern state-of-the-art CFD systems. We model our accelerator using an analytical performance model based on the I/O cost of the algorithm. We empirically evaluate our accelerator on a state-of-the-art Intel Stratix 10 FPGA in terms of performance and power consumption and contrast it against existing solutions on general-purpose processors (CPUs). Finally, we propose a novel data movement-reducing technique where we compute geometric factors on the fly, which yields significant (700+ GFlop/s) single-precision performance and an upwards of 2x reduction in runtime for the local evaluation of the Laplace operator.We end the paper by discussing the challenges and opportunities of using reconfigurable architecture in the future, particularly in the light of emerging (not yet available) technologies.

show abstract

“…OpenCL [12], OpenMP [38], CUDA [36], and even Java [4]. In this particular study, we use HLS as a method for creating a custom accelerator for the spectral element method.…”

Section: Field-programmable Gate Arraysmentioning

confidence: 99%

A High-Fidelity Flow Solver for Unstructured Meshes on Field-Programmable Gate Arrays

Karp,

Podobas,

Kenter

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Today, this trend has likely reached its climax where programmers and users are bewildered by an everincreasing amount of heterogeneous accelerators. Device such as Field-Programmable Gate Arrays (FPGAs) are starting to get recognition for their high-performance computing capabilities [5, 20,22], Coarse-Grained Reconfigurable Architectures (CGRAs) and custom Deep-Learning accelerators are becoming common-place [21], and even alternative computing paradigms such as neuromorphic [24] or quantum systems [11] are emerging. However, among all existing heterogeneos accelerators, none is as ubiquitous as the Graphics Processing Unit (GPU).…”

Section: Introductionmentioning

confidence: 99%

Benchmarking the Nvidia GPU Lineage

Svedin

Chien

Chikafa

et al. 2021

Proceedings of the 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies

Self Cite

View full text Add to dashboard Cite

For many, Graphics Processing Units (GPUs) provides a source of reliable computing power. Recently, Nvidia introduced its 9th generation HPC-grade GPUs, the Ampere 100, claiming significant performance improvements over previous generations, particularly for AI-workloads, as well as introducing new architectural features such as asynchronous data movement. But how well does the A100 perform on non-AI benchmarks, and can we expect the A100 to deliver the application improvements we have grown used to with previous GPU generations? In this paper, we benchmark the A100 GPU and compare it to four previous generations of GPUs, with particular focus on empirically quantifying our derived performance expectations, and -should those expectations be undeliveredinvestigate whether the introduced data-movement features can offset any eventual loss in performance? We find that the A100 delivers less performance increase than previous generations for the well-known Rodinia benchmark suite; we show that some of these performance anomalies can be remedied through clever use of the new data-movement features, which we microbenchmark and demonstrate where (and more importantly, how) they should be used.

show abstract