Parametric GPU Code Generation for Affine Loop Programs

Konstantinidis, Athanasios; Ramanujam, J.; Sadayappan, P.

doi:10.1007/978-3-319-09967-5_8

Cited by 5 publications

(2 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several works investigate the generation of loop code for a number of processors unknown at compile time. Dyn-Tile [10] and D-Tiling [12] target general-purpose multi-cores; Kong et al [14] generate vectorized code for cores supporting SIMD processing; Konstantinidis et al [15] generate parallelized code for GPUs. However, none of these approaches apply to TCPAs because the target architectures do neither rely on cycle-accurate synchronization of components nor require PE-specific compact programs (see Section 6.1) to save space and keep instruction memories small.…”

Section: Other Symbolic Loop Compilation Approachesmentioning

confidence: 99%

Symbolic Loop Compilation for Tightly Coupled Processor Arrays

Witterauf¹,

Walter²,

Hannig³

et al. 2021

Preprint

View full text Add to dashboard Cite

Loop compilation for Tightly Coupled Processor Arrays (TCPAs), a class of massively parallel loop accelerators, entails solving NPhard problems, yet depends on the loop bounds and number of available processing elements (PEs), parameters known only at runtime because of dynamic resource management and input sizes. Therefore, this article proposes a two-phase approach called symbolic loop compilation: At compile time, the necessary NP-complete problems are solved and the solutions compiled into a space-efficient symbolic configuration. At runtime, a concrete configuration is generated from the symbolic configuration according to the parameters values. We show that the latter phase, called instantiation, runs in polynomial time with its most complex step, program instantiation, not depending on the number of PEs.As validation, we performed symbolic loop compilation on real-world loops and measured time and space requirements. Our experiments confirm that a symbolic configuration is space-efficient and suited for systems with little memory-often, a symbolic configuration is smaller than a single concrete configuration-and that program instantiation scales well with the number of PEs-for example, when instantiating a symbolic configuration of a matrix-matrix multiplication, the execution time is similar for 4 × 4 and 32 × 32 PEs. CCS Concepts: • Computer systems organization → Systolic arrays; Embedded and cyber-physical systems; • Software and its engineering → Compilers.

show abstract

Section: Other Symbolic Loop Compilation Approachesmentioning

confidence: 99%

Symbolic Loop Compilation for Tightly Coupled Processor Arrays

Witterauf¹,

Walter²,

Hannig³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Most studies focus on identifying data reuse (e.g., using a polyhedral model) [8]- [12] and exploit it by enabling local memory. Alternatively, in [13], the authors present a fully automated C-to-FPGA framework, including an end-to-end solution for on-chip buffer optimization that automatically detects and implements the available date reuse in a loop nest.…”

Section: Related Workmentioning

confidence: 99%

Grover: Looking for Performance Improvement by Disabling Local Memory Usage in OpenCL Kernels

Fang

Sips

Jääskeläinen

et al. 2014

2014 43rd International Conference on Parallel Processing

View full text Add to dashboard Cite

Abstract-Due to the diversity of processor architectures and application memory access patterns, the performance impact of using local memory in OpenCL kernels has become unpredictable. For example, enabling the use of local memory for an OpenCL kernel can be beneficial for the execution on a GPU, but can lead to performance losses when running on a CPU. To address this unpredictability, we propose an empirical approach: by disabling the use of local memory in OpenCL kernels, we enable users to compare the kernel versions with and without local memory, and further choose the best performing version for a given platform.To this end, we have designed Grover, a method to automatically remove local memory usage from OpenCL kernels. In particular, we create a correspondence between the global and local memory spaces, which is used to replace local memory accesses by global memory accesses. We have implemented this scheme in the LLVM framework as a compiling pass, which automatically transforms an OpenCL kernel with local memory to a version without it. We have validated Grover with 11 applications, and found that it can successfully disable local memory usage for all of them. We have compared the kernels with and without local memory on three different processors, and found performance improvements for more than a third of the test cases after Grover disabled local memory usage. We conclude that such a compiler pass can be beneficial for performance, and, because it is fully automated, it can be used as an auto-tuning step for OpenCL kernels.

show abstract

Alpinist: An Annotation-Aware GPU Program Optimizer

Şakar

Safari

Huisman

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

GPU programs are widely used in industry. To obtain the best performance, a typical development process involves the manual or semi-automatic application of optimizations prior to compiling the code. To avoid the introduction of errors, we can augment GPU programs with (pre- and postcondition-style) annotations to capture functional properties. However, keeping these annotations correct when optimizing GPU programs is labor-intensive and error-prone.This paper introduces Alpinist, an annotation-aware GPU program optimizer. It applies frequently-used GPU optimizations, but besides transforming code, it also transforms the annotations. We evaluate Alpinist, in combination with the VerCors program verifier, to automatically optimize a collection of verified programs and reverify them.

show abstract

Parametric GPU Code Generation for Affine Loop Programs

Cited by 5 publications

References 22 publications

Symbolic Loop Compilation for Tightly Coupled Processor Arrays

Symbolic Loop Compilation for Tightly Coupled Processor Arrays

Grover: Looking for Performance Improvement by Disabling Local Memory Usage in OpenCL Kernels

Alpinist: An Annotation-Aware GPU Program Optimizer

Contact Info

Product

Resources

About