Optimising purely functional GPU programs

McDonell, Trevor L.; Chakravarty, Manuel M. T.; Keller, Gabriele; Lippmeier, Ben

doi:10.1145/2500365.2500595

Cited by 72 publications

(28 citation statements)

References 99 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As in Accelerate [9,30], we deliberately restrict ourselves to a set of primitives for which we know that high performance CPU and GPU implementations exist. In contrast to Accelerate, we allow nesting of primitives to express nested parallelism.…”

Section: Algorithmic Primitivesmentioning

confidence: 99%

“…When generating code, these rules in effect allow us to fuse the implementation of different functions and avoid having to store temporary results. The functional programming community has studied more sophisticated and generic rules for fusion [13,26,30]. However, for our current restricted set of benchmarks our simpler fusion rules have proven to be sufficient.…”

Section: Reduce Rulesmentioning

confidence: 99%

See 1 more Smart Citation

Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code

SteuwerMichel¹,

FenschChristian²,

LindleySam³

et al. 2015

SIGPLAN Not.

View full text Add to dashboard Cite

Computers have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort resulting in a tension between performance and code portability. Typically, code is either tuned in a low-level imperative language using hardware-specific optimizations to achieve maximum performance or is written in a high-level, possibly functional, language to achieve portability at the expense of performance.We propose a novel approach aiming to combine high-level programming, code portability, and high-performance. Starting from a high-level functional expression we apply a simple set of rewrite rules to transform it into a low-level functional representation, close to the OpenCL programming model, from which OpenCL code is generated. Our rewrite rules define a space of possible implementations which we automatically explore to generate hardware-specific OpenCL implementations. We formalize our system with a core dependently-typed λ-calculus along with a denotational semantics which we use to prove the correctness of the rewrite rules.We test our design in practice by implementing a compiler which generates high performance imperative OpenCL code. Our experiments show that we can automatically derive hardwarespecific implementations from simple functional high-level algorithmic expressions offering performance on a par with highly tuned code for multicore CPUs and GPUs written by experts.

show abstract

Section: Algorithmic Primitivesmentioning

confidence: 99%

Section: Reduce Rulesmentioning

confidence: 99%

Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code

SteuwerMichel¹,

FenschChristian²,

LindleySam³

et al. 2015

SIGPLAN Not.

View full text Add to dashboard Cite

show abstract

“…This process involves mapping of parallelism, performing optimizations such as fusion of operations and finally code generation. This approach is used by a many systems such as Copperhead [5], Delite [3], Accelerate [6,15], LiquidMetal [9], HiDP [21], Halide [16] and NOVA [8].…”

Section: Related Work and Motivationmentioning

confidence: 99%

“…High-level languages such as Lift [18], Accelerate [15], Delite [19], StreamIt [20] or Halide [16] have been proposed to ease programming of GPUs. These approaches are all based on parallel patterns, a concept developed in the late 80's [7].…”

Section: Introductionmentioning

confidence: 99%

LIFT: A functional data-parallel IR for high-performance GPU code generation

Steuwer

Remmelg

Dubach

2017

2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

105

113

View full text Add to dashboard Cite

Parallel patterns (e.g., map, reduce) have gained traction as an abstraction for targeting parallel accelerators and are a promising answer to the performance portability problem. However, compiling high-level programs into efficient lowlevel parallel code is challenging. Current approaches start from a high-level parallel IR and proceed to emit GPU code directly in one big step. Fixed strategies are used to optimize and map parallelism exploiting properties of a particular GPU generation leading to performance portability issues.We introduce the Lift IR, a new data-parallel IR which encodes OpenCL-specific constructs as functional patterns. Our prior work has shown that this functional nature simplifies the exploration of optimizations and mapping of parallelism from portable high-level programs using rewrite-rules. This paper describes how Lift IR programs are compiled into efficient OpenCL code. This is non-trivial as many performance sensitive details such as memory allocation, array accesses or synchronization are not explicitly represented in the Lift IR. We present techniques which overcome this challenge by exploiting the pattern's high-level semantics. Our evaluation shows that the Lift IR is flexible enough to express GPU programs with complex optimizations achieving performance on par with manually optimized code.

show abstract

“…Data parallel patterns are implemented as fixed OpenCL kernels lacking portability. Accelerate [13] is a domain specific language embedded in Hasekll for GPU programming. The implementation relies on templates of manually written CUDA kernels.…”

Section: High-level Gpu Programming Approachesmentioning

confidence: 99%

Performance portable GPU code generation for matrix multiplication

Remmelg

Lutz

Steuwer

et al. 2016

Proceedings of the 9th Annual Workshop on General Purpose Processing Using Graphics Processing Unit

View full text Add to dashboard Cite

Parallel accelerators such as GPUs are notoriously hard to program; exploiting their full performance potential is a job best left for ninja programmers. High-level programming languages coupled with optimizing compilers have been proposed to attempt to address this issue. However, they rely on device-specific heuristics or hard-coded library implementations to achieve good performance resulting in non-portable solutions that need to be re-optimized for every new device.Achieving performance portability is the holy grail of high-performance computing and has so far remained an open problem even for well studied applications like matrix multiplication. We argue that what is needed is a way to describe applications at a high-level without committing to particular implementations. To this end, we developed in a previous paper a functional data-parallel language which allows applications to be expressed in a device neutral way. We use a set of well-defined rewrite rules to automatically transform programs into semantically equivalent devicespecific forms, from which OpenCL code is generated.In this paper, we demonstrate how this approach produces high-performance OpenCL code for GPUs with a wellstudied, well-understood application: matrix multiplication. Starting from a single high-level program, our compiler automatically generate highly optimized and specialized implementations. We group simple rewrite rules into more complex macro-rules, each describing a well-known optimization like tiling and register blocking in a composable way. Using an exploration strategy our compiler automatically generates 50,000 OpenCL kernels, each providing a differently optimized -but provably correct -implementation of matrix multiplication. The automatically generated code offers competitive performance compared to the manually tuned MAGMA library implementations of matrix multiplication on Nvidia and even outperforms AMD's clBLAS library.

show abstract

Optimising purely functional GPU programs

Cited by 72 publications

References 99 publications

Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code

Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code

LIFT: A functional data-parallel IR for high-performance GPU code generation

Performance portable GPU code generation for matrix multiplication

Contact Info

Product

Resources

About