Dynamic compilation of data-parallel kernels for vector processors

Kerr, Andrew; Diamos, Gregory; Yalamanchili, S.

doi:10.1145/2259016.2259020

Cited by 11 publications

(9 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We assess the efficiency of one particular approach to software-based compaction-the execution manager of Kerr et al [2012]. At the current state of the art, this approach promises to most effectively eliminate control-flow divergence, because to the best of our knowledge, it is the approach that has the most freedom to rearrange threads.…”

Section: Control-flow Divergence Analysismentioning

confidence: 99%

See 1 more Smart Citation

The Impact of the SIMD Width on Control-Flow and Memory Divergence

Schaub

Moll

Karrenberg

et al. 2015

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Power consumption is a prevalent issue in current and future computing systems. SIMD processors amortize the power consumption of managing the instruction stream by executing the same instruction in parallel on multiple data. Therefore, in the past years, the SIMD width has steadily increased, and it is not unlikely that it will continue to do so. In this article, we experimentally study the influence of the SIMD width to the execution of data-parallel programs. We investigate how an increasing SIMD width (up to 1024) influences control-flow divergence and memory-access divergence, and how well techniques to mitigate them will work on larger SIMD widths. We perform our study on 76 OpenCL applications and show that a group of programs scales well up to SIMD width 1024, whereas another group of programs increasingly suffers from controlflow divergence. For those programs, thread regrouping techniques may become increasingly important for larger SIMD widths. We show what average speedups can be expected when increasing the SIMD width. For example, when switching from scalar execution to SIMD width 64, one can expect a speedup of 60.11, which increases to 62.46 when using thread regrouping. We also analyze the frequency of regular (uniform, consecutive) memory access patterns and observe a monotonic decrease of regular memory accesses from 82.6% at SIMD width 4 to 43.1% at SIMD width 1024.

show abstract

Section: Control-flow Divergence Analysismentioning

confidence: 99%

“…(3) The execution manager [Kerr et al 2012] or its configuration could be an inappropriate choice for a baseline. Although unexpected (see the discussion in Section 2.1), other approaches (see Section 4) could exhibit better performance.…”

Section: Threats To Validitymentioning

confidence: 99%

The Impact of the SIMD Width on Control-Flow and Memory Divergence

Schaub

Moll

Karrenberg

et al. 2015

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…Their technique finds parallelism at the level of work-item coalescing loops. Kerr et al [15] propose a similar technique for CUDA kernels. We leave as our future work implementing auto-vectorization techniques in our framework and evaluating their performance on ARM processors.…”

Section: Related Workmentioning

confidence: 99%

“…The prior studies proposed methods to compile and execute applications written in OpenCL [10,12,18] or other accelerator programming models such as CUDA [11,15,24] on multicore CPUs. But they target x86 processors, while our work focuses on ARM processors for embedded systems.…”

Section: Introductionmentioning

confidence: 99%

OpenCL framework for ARM processors with NEON support

Jeon

Jung

et al. 2014

Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing

View full text Add to dashboard Cite

The state-of-the-art ARM processors provide multiple cores and SIMD instructions. OpenCL is a promising programming model for utilizing such parallel processing capability because of its SPMD programming model and built-in vector support. Moreover, it provides portability between multicore ARM processors and accelerators in embedded systems. In this paper, we introduce the design and implementation of an efficient OpenCL framework for multicore ARM processors. Computational tasks in a program are implemented as OpenCL kernels and run on all CPU cores in parallel by our OpenCL framework. Vector operations and built-in functions in OpenCL kernels are optimized using the NEON SIMD instruction set. We evaluate our OpenCL framework using 37 benchmark applications. The result shows that our approach is effective and promising.

show abstract

“…Kerr et al [14] implement a thread-invariant expression elimination pass, also based on [26]. The focus of their optimization pass is different than ours; they use common subexpression elimination on invariants after vectorization, whereas we allocate invariants to scalar register.…”

Section: Related Workmentioning

confidence: 99%

Convergence and scalarization for data-parallel architectures

Asanović

Keckler

Lee

et al. 2013

Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

View full text Add to dashboard Cite

Modern throughput processors such as GPUs achieve high performance and efficiency by exploiting data parallelism in application kernels expressed as threaded code. One drawback of this approach compared to conventional vector architectures is redundant execution of instructions that are common across multiple threads, resulting in energy inefficiency due to excess instruction dispatch, register file accesses, and memory operations. This paper proposes to alleviate these overheads while retaining the threaded programming model by automatically detecting the scalar operations and factoring them out of the parallel code. We have developed a scalarizing compiler that employs convergence and variance analyses to statically identify values and instructions that are invariant across multiple threads. Our compiler algorithms are effective at identifying convergent execution even in programs with arbitrary control flow, identifying two-thirds of the opportunity captured by a dynamic oracle. The compile-time analysis leads to a reduction in instructions dispatched by 29%, register file reads and writes by 31%, memory address counts by 47%, and data access counts by 38%.

show abstract

Dynamic compilation of data-parallel kernels for vector processors

Cited by 11 publications

References 17 publications

The Impact of the SIMD Width on Control-Flow and Memory Divergence

The Impact of the SIMD Width on Control-Flow and Memory Divergence

OpenCL framework for ARM processors with NEON support

Convergence and scalarization for data-parallel architectures

Contact Info

Product

Resources

About