KernelGen -- The Design and Implementation of a Next Generation Compiler Platform for Accelerating Numerical Models on GPUs

Mikushin, Dmitry; Likhogrud, Nikolay; Zhang, Eddy Z.; Bergström, Christopher

doi:10.1109/ipdpsw.2014.115

Cited by 17 publications

(9 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…NVIDIA's jitifiy [48] is a library that simplifies the use of CUDA Runtime Compilation (NVRTC). KernelGen [49] is a Fortran/C compiler that automates GPU code generation with polyhedral loop analysis of LLVM IR. Those works present dynamic features such as runtime alias analysis and parameter tuning alongside kernel specialization.…”

Section: Related Workmentioning

confidence: 99%

JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization

Matsumura,

De Gonzalo,

Peña

2021

Preprint

View full text Add to dashboard Cite

The rapid development in computing technology has paved the way for directive-based programming models towards a principal role in maintaining software portability of performance-critical applications. Efforts on such models involve a least engineering cost for enabling computational acceleration on multiple architectures while programmers are only required to add meta information upon sequential code. Optimizations for obtaining the best possible efficiency, however, are often challenging. The insertions of directives by the programmer can lead to side-effects that limit the available compiler optimization possible, which could result in performance degradation. This is exacerbated when targeting multi-GPU systems, as pragmas do not automatically adapt to such systems, and require expensive and time consuming code adjustment by programmers.This paper introduces JACC, an OpenACC runtime framework which enables the dynamic extension of OpenACC programs by serving as a transparent layer between the program and the compiler. We add a versatile code-translation method for multi-device utilization by which manuallyoptimized applications can be distributed automatically while keeping original code structure and parallelism. We show in some cases nearly linear scaling on the part of kernel execution with the NVIDIA V100 GPUs. While adaptively using multi-GPUs, the resulting performance improvements amortize the latency of GPU-to-GPU communications.

show abstract

Section: Related Workmentioning

confidence: 99%

JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization

Matsumura,

De Gonzalo,

Peña

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Our implementation currently supports converting OpenMP code to HSTREAMS, CUDA and OpenCL programs. While we do not claim novelty on this as several works on source-to-source translation from OpenMP to CUDA [23], [24], [25], [26] or OpenCL [20], [27] exist, we believe the tool could serve as a useful utility for translating OpenMP programs to exploit multi-stream performance on heterogeneous many-core architectures. Figure 7 depicts our source to source code generator for translating OpenMP code to streamed programs.…”

Section: Openmp To Streamed Code Generatormentioning

confidence: 99%

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures

Zhang

Fang

Yang

et al. 2020

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

As many-core accelerators keep integrating more processing units, it becomes increasingly more difficult for a parallel application to make effective use of all available resources. An effective way for improving hardware utilization is to exploit spatial and temporal sharing of the heterogeneous processing units by multiplexing computation and communication tasks -a strategy known as heterogeneous streaming. Achieving effective heterogeneous streaming requires carefully partitioning hardware among tasks, and matching the granularity of task parallelism to the resource partition. However, finding the right resource partitioning and task granularity is extremely challenging, because there is a large number of possible solutions and the optimal solution varies across programs and datasets. This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration. The model is used as a utility to quickly search for a good configuration at runtime. Instead of hand-crafting an analytical model that requires expert insights into low-level hardware details, we employ machine learning techniques to automatically learn it. We achieve this by first learning a predictive model offline using training programs. The learnt model can then be used to predict the performance of any unseen program at runtime. We apply our approach to 39 representative parallel applications and evaluate it on two representative heterogeneous many-core platforms: a CPU-XeonPhi platform and a CPU-GPU platform. Compared to the single-stream version, our approach achieves, on average, a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively. These results translate to over 93% of the performance delivered by a theoretically perfect predictor.

show abstract

“…CodeExtractor does a flow analysis to detect all the live-in and live-out dependencies of the region to extract [Mikushin et al 2013]. This pass simplifies the codelet extraction process, since it extracts the region code in its own function.…”

Section: Ir Capture and Replay Overviewmentioning

confidence: 99%

Cere

Castro

Akel²,

Petit

et al. 2015

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Computing Research CHADI AKEL, Exascale Computing Research ERIC PETIT and MIHAIL POPOV, Université de Versailles Saint-Quentin-en-Yvelines WILLIAM JALBY, Exascale Computing Research This article presents Codelet Extractor and REplayer (CERE), an open-source framework for code isolation. CERE finds and extracts the hotspots of an application as isolated fragments of code, called codelets. Codelets can be modified, compiled, run, and measured independently from the original application. Code isolation reduces benchmarking cost and allows piecewise optimization of an application. Unlike previous approaches, CERE isolates codes at the compiler Intermediate Representation (IR) level. Therefore CERE is language agnostic and supports many input languages such as C, C++, Fortran, and D. CERE automatically detects codelets invocations that have the same performance behavior. Then, it selects a reduced set of representative codelets and invocations, much faster to replay, which still captures accurately the original application. In addition, CERE supports recompiling and retargeting the extracted codelets. Therefore, CERE can be used for cross-architecture performance prediction or piecewise code optimization. On the SPEC 2006 FP benchmarks, CERE codelets cover 90.9% and accurately replay 66.3% of the execution time. We use CERE codelets in a realistic study to evaluate three different architectures on the NAS benchmarks. CERE accurately estimates each architecture performance and is 7.3× to 46.6× cheaper than running the full benchmark.

show abstract

KernelGen -- The Design and Implementation of a Next Generation Compiler Platform for Accelerating Numerical Models on GPUs

Cited by 17 publications

References 4 publications

JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization

JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures

Cere

Contact Info

Product

Resources

About