Improving Performance of OpenCL on CPUs

Karrenberg, Ralf; Hack, Sebastian

doi:10.1007/978-3-642-28652-0_1

Cited by 47 publications

(27 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A similar analysis has been proposed for optimizing OpenCL kernels for CPU (rather than GPU) performance [24]. The analyses were developed independently.…”

Section: Engineering Issues For Efficient Verificationmentioning

confidence: 99%

Engineering a Static Verification Tool for GPU Kernels

Bardsley

Betts

Chong

et al. 2014

Computer Aided Verification

View full text Add to dashboard Cite

Abstract. We report on practical experiences over the last 2.5 years related to the engineering of GPUVerify, a static verification tool for OpenCL and CUDA GPU kernels, plotting the progress of GPUVerify from a prototype to a fully functional and relatively efficient analysis tool. Our hope is that this experience report will serve the verification community by helping to inform future tooling efforts.

show abstract

“…A similar analysis has been proposed for optimizing OpenCL kernels for CPU (rather than GPU) performance [24]. The analyses were developed independently.…”

Section: Engineering Issues For Efficient Verificationmentioning

confidence: 99%

Engineering a Static Verification Tool for GPU Kernels

Bardsley

Betts

Chong

et al. 2014

Computer Aided Verification

View full text Add to dashboard Cite

show abstract

“…However, these approaches have limitations, including the exclusion of instructions with side-effects from poly-path execution [18]. Karrenberg and Hack [12,13] propose compiler algorithms to map OpenCL kernels down to packed-SIMD units with explicit vector blend instructions.…”

Section: A Vector Machinesmentioning

confidence: 99%

“…Karrenberg and Hack [13] describe a similar static branchuniformity optimization to reduce register pressure while vectorizing OpenCL kernels for packed-SIMD units in x86 processors. Reducing register pressure is especially important on an x86 processor, as it only has a limited number of scalar and vector registers.…”

Section: Linearizing the Control Flowmentioning

confidence: 99%

See 1 more Smart Citation

Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures

Lee

Grover²,

Krashinsky³

et al. 2014

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

View full text Add to dashboard Cite

Abstract-Data-parallel architectures must provide efficient support for complex control-flow constructs to support sophisticated applications coded in modern single-program multipledata languages. As these architectures have wide datapaths that process a single instruction across parallel threads, a mechanism is needed to track and sequence threads as they traverse potentially divergent control paths through the program. The design space for divergence management ranges from softwareonly approaches where divergence is explicitly managed by the compiler, to hardware solutions where divergence is managed implicitly by the microarchitecture. In this paper, we explore this space and propose a new predication-based approach for handling control-flow structures in data-parallel architectures. Unlike prior predication algorithms, our new compiler analyses and hardware instructions consider the commonality of predication conditions across threads to improve efficiency. We prototype our algorithms in a production compiler and evaluate the tradeoffs between software and hardware divergence management on current GPU silicon. We show that our compiler algorithms make a predication-only architecture competitive in performance to one with hardware support for tracking divergence.

show abstract

“…It is initially designed for GPGPU architectures [4,6]. And it can also be mapped to general purpose CPUs efficiently [23]. Recent studies attempt to use OpenCL for more diverging target architectures, such as FPGA [18] and ASIP [21].…”

Section: Related Workmentioning

confidence: 99%

A Co-Design Framework with OpenCL Support for Low-Energy Wide SIMD Processor

She

Waeijen

et al. 2014

J Sign Process Syst

View full text Add to dashboard Cite

Energy efficiency is one of the most important metrics in embedded processor design. The use of wide SIMD architecture is a promising approach to build energyefficient high performance embedded processors. In this paper, we propose a design framework for a configurable wide SIMD architecture that utilizes an explicit datapath to achieve high energy efficiency. The framework is able to generate processor instances based on architecture specification files. It includes a compiler to efficiently program the proposed architecture with standard programming languages including OpenCL. This compiler can analyze the static memory access patterns in OpenCL kernels, generate efficient mappings, and schedule the code to fully utilize the explicit datapath. Extensive experimental results show that the proposed architecture is efficient and scalable in terms of area, performance, and energy. In a 128-PE SIMD processor, the proposed architecture is able to achieve up to 200 times speed-up and reduce the total energy consumption by 50 % compared to a basic RISC processor.

show abstract

Improving Performance of OpenCL on CPUs

Cited by 47 publications

References 15 publications

Engineering a Static Verification Tool for GPU Kernels

Engineering a Static Verification Tool for GPU Kernels

Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures

A Co-Design Framework with OpenCL Support for Low-Energy Wide SIMD Processor

Contact Info

Product

Resources

About