A general approach for partitioning N-dimensional parallel nested loops with conditionals

Kejariwal, Arun; Nicolau, Alexandru; Saito, Hideki; Tian, Xin; Girkar, Milind; Banerjee, Utpal; Polychronopoulos, Constantine D.

doi:10.1145/1148109.1148117

Cited by 5 publications

(3 citation statements)

References 31 publications

(53 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, improving the parallelization of kernels with diverging branches (parts executed only by a subset of the work-items) is one of the low-hanging fruits. There is some previous work available that is targeted towards enhanced load-balancing which could be adapted to improving the fine-grained parallelization on machines with limited support for predication as well [32]. 1 7 .…”

Section: Discussionmentioning

confidence: 99%

pocl: A Performance-Portable OpenCL Implementation

Jääskeläinen

Lama²,

Schnetter

et al. 2014

Int J Parallel Prog

108

View full text Add to dashboard Cite

OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with different characteristics, and, thus, required optimization strategies. In this paper, we propose an OpenCL implementation that is both portable and performance portable. At its core is a kernel compiler that can be used to exploit the data parallelism of OpenCL programs on multiple platforms with different parallel hardware styles. The kernel compiler is modularized to perform target-independent parallel region formation separately from the target-specific parallel mapping of the regions to enable support for various styles of fine-grained parallel resources such as subword SIMD extensions, SIMD datapaths and static multi-issue. Unlike previous similar techniques that work on the source level, the parallel region formation retains the information of the data parallelism using the LLVM IR and its metadata infrastructure. This data can be exploited by the later generic compiler passes for efficient parallelization. The proposed open source implementation of OpenCL is also platform portable, enabling OpenCL on a wide range of architectures, both already commercialized and on those that are still under research. The paper describes how the portability of the implementation is achieved. Our results show that most of the benchmarked applications when compiled using pocl were faster or close to as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via arxi

show abstract

Section: Discussionmentioning

confidence: 99%

pocl: A Performance-Portable OpenCL Implementation

Jääskeläinen

Lama²,

Schnetter

et al. 2014

Int J Parallel Prog

108

View full text Add to dashboard Cite

show abstract

“…Unlike [8], the non-perfect nature of the loop model is not restricted to conditionals. The loop model also supports multi-way loops [14], i.e., multiple loops may be present at the same level.…”

Section: The Approachmentioning

confidence: 99%

“…A significant amount of work has been done in the context of static partitioning of (parallel) loop nests with rectangular as well as non-rectangular iteration spaces [4,5,6,7,8,9,10]. However, the existing techniques are cache-oblivious, i.e., they do not capture the variation in the number of cache misses across the iteration space.…”

Section: Introductionmentioning

confidence: 99%

Cache-aware iteration space partitioning

Kejariwal

Nicolau

Banerjee

et al. 2008

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Self Cite

View full text Add to dashboard Cite

The need for high performance per watt has led to the development of multi-core systems such as the Intel Core 2 Duo processor and the Intel quad-core Kentsfield processor. Maximal exploitation of the hardware parallelism supported by such systems necessitates the development of concurrent software. This, in part, entails program parallelization and efficient mapping of the parallelized program onto the different cores. The latter affects the load balance between the different cores which in turn has a direct impact on performance. In light of the fact that parallel loops, such as a parallel DO loop in Fortran, account for a large percentage of the total execution time, we focus on the problem of how to efficiently partition the iteration space of (possibly) nested perfect/non-perfect parallel loops. In this regard, one of the key aspects is how to efficiently capture the cache behavior as the cache subsystem is often the main performance bottleneck in multi-core systems. In this paper, we present a novel profile-guided compiler technique for cacheaware partitioning of iteration spaces of parallel loops. We present a case study using a kernel from the industry-standard SPEC CPU benchmark suite.

show abstract