Combining high productivity and high performance in image processing using Single Assignment C on multi-core CPUs and many-core GPUs

Wieser, Volkmar

doi:10.1117/1.jei.21.2.021116

Cited by 5 publications

(3 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We conclude that, for the running example of all-pairs N -body simulation, SAC and its tool chain do exhibit the desired combination of high software engineering productivity and high execution performance. These findings are in line with several previous application studies [26][27][28].…”

Section: Discussionsupporting

confidence: 94%

See 1 more Smart Citation

SaC/C formulations of the all‐pairs N‐body problem and their performance on SMPs and GPGPUs

Šinkarovs

Scholz

Bernecky³

et al. 2013

Concurrency and Computation

View full text Add to dashboard Cite

This paper describes our experience in implementing the classical N -body algorithm in SAC and analysing the runtime performance achieved on three different machines: a dual-processor 8-core Dell PowerEdge 2950 (a Beowulf cluster node, the reference machine), a quad-core hyper-threaded Intel Core-i7 based system equipped with an NVidia GTX-480 graphics accelerator and an Oracle Sparc T4-4 server with a total of 256 hardware threads. We contrast our findings with those resulting from the reference C code and a few variants of it that employ OpenMP pragmas as well as explicit vectorisation. Our experiments demonstrate that the SAC implementation successfully combines a high level of abstraction, very close to the mathematical specification, with very competitive runtimes. In fact, SAC matches or outperforms the hand-vectorised and hand-parallelised C codes on all three systems under investigation without the need for any source code modification. Furthermore, only SAC is able to effectively harness the advanced compute power of the graphics accelerator, again by mere recompilation of the same source code. Our results illustrate the benefits that SAC provides to application programmers in terms of coding productivity, source code, and performance portability among different machine architectures, as well as long-term maintainability in evolving hardware environments.

show abstract

Section: Discussionsupporting

confidence: 94%

“…These findings are in line with several previous application studies [26][27][28]. These findings are in line with several previous application studies [26][27][28].…”

Section: Discussionsupporting

confidence: 93%

SaC/C formulations of the all‐pairs N‐body problem and their performance on SMPs and GPGPUs

Šinkarovs

Scholz

Bernecky³

et al. 2013

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…There is a need for One data locality approach in data parallel language compilers is to start from an imperative language with loops, and fuse the successive loops over the input image1 into an expression tree in a single loop, to improve cache locality and on chip register locality e.g. [6,12]. For CPU or GPU scheduling, this expression tree can be duplicated to apply the same fused computation on image chunks in a data parallel fashion.…”

Section: Eliminating Intermediate Buffers With Compiler Optimisationmentioning

confidence: 99%

A Dataflow IR for Memory Efficient RIPL Compilation to FPGAs

Stewart¹,

Michaelson²,

Bhowmik³

et al. 2016

Algorithms and Architectures for Parallel Processing

View full text Add to dashboard Cite

Abstract. Field programmable gate arrays (FPGAs) are fundamentally different to fixed processors architectures because their memory hierarchies can be tailored to the needs of an algorithm. FPGA compilers for high level languages are not hindered by fixed memory hierarchies. The constraint when compiling to FPGAs is the availability of resources. In this paper we describe how the dataflow intermediary of our declarative FPGA image processing DSL called RIPL 3 enables us to constrain memory. We use five benchmarks to demonstrate that memory use with RIPL is comparable to the Vivado HLS OpenCV library without the need for language pragmas to guide hardware synthesis. The benchmarks also show that RIPL is more expressive than the Darkroom FPGA image processing language.

show abstract

Single Assignment C (SAC)

Grelck

2019

Central European Functional Programming School

View full text Add to dashboard Cite

Combining high productivity and high performance in image processing using Single Assignment C on multi-core CPUs and many-core GPUs

Cited by 5 publications

References 13 publications

SaC/C formulations of the all‐pairs N‐body problem and their performance on SMPs and GPGPUs

SaC/C formulations of the all‐pairs N‐body problem and their performance on SMPs and GPGPUs

A Dataflow IR for Memory Efficient RIPL Compilation to FPGAs

Single Assignment C (SAC)

Contact Info

Product

Resources

About