Developing High-Performance, Portable OpenCL Code via Multi-Dimensional Homomorphisms

Rasch, Ari; Schulze, Richard; Gorlatch, Sergei

doi:10.1145/3318170.3318171

Cited by 4 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This group includes different unique solutions. [38] focuses on achieving cross-platform high performance. The goal is to have a single representation of an algorithm that can be translated into an efficient implementation on various computing systems.…”

Section: Existing Solutionsmentioning

confidence: 99%

“…The means of achievement is the so-called "multidimensional homomorphisms", a formal description of a problem at a high level that allows expressing computations with parallel patterns that can be implemented in different ways on different hardware. A significant limitation of [38] is the use of OpenCL as a low-level layer which does not provide access to many of the new capabilities of modern GPUs (including mobile ones) due to the limitations of the OpenCL. Next, [38] uses extremely limited DSL, which is a significant drawback.…”

Section: Existing Solutionsmentioning

confidence: 99%

“…A significant limitation of [38] is the use of OpenCL as a low-level layer which does not provide access to many of the new capabilities of modern GPUs (including mobile ones) due to the limitations of the OpenCL. Next, [38] uses extremely limited DSL, which is a significant drawback.…”

Section: Existing Solutionsmentioning

confidence: 99%

“…There are no technologies which can achieve 2 goals simultaneously: (1) cross-platform ability and (2) accessing specific HW features, because existing solutions don't have an intermediate layer between high-level algorithm description and its actual implementation. Best results in this direction have been achieved in [43,44,38,39].…”

Section: Conclusion On Existing Solutionsmentioning

confidence: 99%

See 3 more Smart Citations

An Auto-Programming Approach to Vulkan

Frolov¹,

Sanzharov

Галактионов³

et al. 2021

Proceedings of the 31th International Conference on Computer Graphics and Vision. Volume 2

View full text Add to dashboard Cite

We propose a novel high-level approach for software development on GPU using Vulkan API. Our goal is to speed-up development and performance studies for complex algorithms on GPU, which is quite difficult and laborious for Vulkan due to large number of HW features low level details. The proposed approach uses auto programming to translate ordinary C++ to optimized Vulkan implementation with automatic shaders generation, resource binding and fine-grained barriers placement. Our model is not general-purpose programming, but is extendible and customer-focused. For a single C++ input our tool can generate multiple different implementations of algorithm in Vulkan for different cases or types of hardware. For example, we automatically detect reduction in C++ source code and then generate several variants of parallel reduction on GPU: with optimization for different warp size, with or without atomics, using or not subgroup operations. Another example is GPU ray tracing applications for which we can generate different variants: pure software implementation in compute shader, using hardware accelerated ray queries, using full RTX pipeline. The goal of our work is to increase productivity of developers who are forced to use Vulkan due to various required hardware features in their software but still do care about cross-platform ability of the developed software and want to debug their algorithm logic on the CPU. Therefore, we assume that the user will take generated code and integrate it with hand-written Vulkan code.

show abstract

Section: Existing Solutionsmentioning

confidence: 99%

Section: Existing Solutionsmentioning

confidence: 99%

Section: Existing Solutionsmentioning

confidence: 99%

Section: Conclusion On Existing Solutionsmentioning

confidence: 99%

See 2 more Smart Citations

An Auto-Programming Approach to Vulkan

Frolov¹,

Sanzharov

Галактионов³

et al. 2021

Proceedings of the 31th International Conference on Computer Graphics and Vision. Volume 2

View full text Add to dashboard Cite

show abstract

“…Auto-Tuning and Program Synthesis Auto-Tuning approaches including Halide's auto-tuners [1,16], OpenTuner [2], ATF [17] and MD-Hom [18], and program synthesis techniques such as SwizzleInventor [14] aim to automatically develop optimized implementations by navigating a search space of possible implementations. We see potential for a similar automatic search space exploration for Fireiron's decompositions, however as of today, Fireiron is designed as a tool for performance experts, simplifying the development of optimizations rather than automatically searching for highly optimized implementations.…”

Section: Related Workmentioning

confidence: 99%

Fireiron: A Scheduling Language for High-Performance Linear Algebra on GPUs

Hagedorn¹,

Elliott²,

Barthels³

et al. 2020

Preprint

View full text Add to dashboard Cite

Achieving high-performance GPU kernels requires optimizing algorithm implementations to the targeted GPU architecture. It is of utmost importance to fully use the compute and memory hierarchy, as well as available specialised hardware.Currently, vendor libraries like cuBLAS and cuDNN provide the best performing implementations of GPU algorithms. However the task of the library programmer is incredibly challenging: for each provided algorithm, high-performance implementations have to be developed for all commonly used architectures, input sizes, and different storage formats. These implementations are generally provided as optimized assembly code because performance-critical architectural features are only exposed at this level. This prevents reuse between different implementations of even the same algorithm, as simple differences can have major effects on low-level implementation details.In this paper we introduce Fireiron, a DSL and compiler which allows the specification of high-performance GPU implementations as compositions of simple and reusable building blocks. We show how to use Fireiron to optimize matrix multiplication implementations, achieving performance matching hand-coded CUDA kernels, even when using specialised hardware such as NIVIDA Tensor Cores, and outperforming state-of-the-art implementations provided by cuBLAS by more than 2×.

show abstract