A compiler framework for optimization of affine loop nests for gpgpus

Baskaran, Muthu Manikandan; Bondhugula, Uday; Krishnamoorthy, Sriram; Ramanujam, J.; Rountev, Atanas; Sadayappan, P.

doi:10.1145/1375527.1375562

Cited by 167 publications

(113 citation statements)

References 19 publications

Supporting

Mentioning

113

Contrasting

Order By: Relevance

“…Compared to a set of recent studies on performance autotuning by empirical search [11,12,13,14,15], we provide an alternative optimization solution. Certainly search-based approaches are a powerful tool for optimization, but we note two disadvantages of such an approach.…”

Section: Introductionmentioning

confidence: 99%

A quantitative performance analysis model for GPU architectures

Zhang

Owens

2011

2011 IEEE 17th International Symposium on High Performance Computer Architecture

203

100

View full text Add to dashboard Cite

We develop a microbenchmark-based performance model for NVIDIA GeForce 200-series GPUs. Our model identifies GPU program bottlenecks and quantitatively analyzes performance, and thus allows programmers and architects to predict the benefits of potential program optimizations and architectural improvements. In particular, we use a microbenchmark-based approach to develop a throughput model for three major components of GPU execution time: the instruction pipeline, shared memory access, and global memory access. Because our model is based on the GPU's native instruction set, we can predict performance with a 5-15% error. To demonstrate the usefulness of the model, we analyze three representative real-world and already highly-optimized programs: dense matrix multiply, tridiagonal systems solver, and sparse matrix vector multiply. The model provides us detailed quantitative analysis on performance, allowing us to understand the configuration of the fastest dense matrix multiply implementation and to optimize the tridiagonal solver and sparse matrix vector multiply by 60% and 18% respectively. Furthermore, our model applied to analysis on these codes allows us to suggest architectural improvements on hardware resource allocation, avoiding bank conflicts, block scheduling, and memory transaction granularity.

show abstract

Section: Introductionmentioning

confidence: 99%

A quantitative performance analysis model for GPU architectures

Zhang

Owens

2011

2011 IEEE 17th International Symposium on High Performance Computer Architecture

203

100

View full text Add to dashboard Cite

show abstract

“…Thus even though the GPU part of the Intel heterogeneous processor has higher single precision theoretical peak performance than its CPU part, the delivered SpMV throughput is lower than expected. For the CSR-vector method, the low performance has another reason: small thread-bunch of size 8 dramatically increases loop overhead [40], which is one of the well known bottlenecks [41] of GPU programming. In Figures 4 and 5, we can see that on the AMD heterogeneous processor, our method delivers up to 71.90x (94.05x) and on average 22.17x (22.88x) speedup over the single (double) precision CSR-scalar method running on the used GPU.…”

Section: Performance Analysismentioning

confidence: 99%

Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors

Liu

Vinter

2015

Parallel Computing

View full text Add to dashboard Cite

Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cores in a CPU-GPU heterogeneous processor. We first speculatively execute segmented sum operations on the GPU part of a heterogeneous processor and generate a possibly incorrect results. Then the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector. On three heterogeneous processors from Intel, AMD and nVidia, using 20 sparse matrices as a benchmark suite, the experimental results show that our method obtains significant performance improvement over the best existing CSR-based SpMV algorithms.

show abstract

“…Liu et al [25] looked at varying thread block dimensions and loop unrolling as optimizations in CUDA kernels. Baskaran et al [2] did an extensive study of loop unrolling within CUDA kernels, but they were concerned with improving the performance of a single application.…”

Section: Related Workmentioning

confidence: 99%

“…In this work, we use this approach to obtain optimized versions of 2D and 3D convolution kernels, codes in the PolyBench [1] suite, and an implementation of belief propagation for stereo vision. The main contributions of this paper are: (1) showing that auto-tuning applied to GPU kernels written in a directivebased language can be used to effectively parallelize and optimize a variety of codes, in many cases meeting or exceeding the performance of hand-written GPU programs, (2) showing how particular transformations affect performance and describing the best transformation configuration found for each kernel, and (3) showing that the best transformations are kernel and architecture specific.…”

Section: Introductionmentioning

confidence: 99%

Auto-tuning a high-level language targeted to GPU codes

Grauer-Gray

Schleicher

et al. 2012

2012 Innovative Parallel Computing (InPar)

365

154

View full text Add to dashboard Cite

Determining the best set of optimizations to apply to a kernel to be executed on the graphics processing unit (GPU) is a challenging problem. There are large sets of possible optimization configurations that can be applied, and many applications have multiple kernels. Each kernel may require a specific configuration to achieve the best performance, and moving an application to new hardware often requires a new optimization configuration for each kernel.In this work, we apply optimizations to GPU code using HMPP, a high-level directive-based language and source-tosource compiler that can generate CUDA / OpenCL code. However, programming with high-level languages may mean a loss of performance compared to using low-level languages. Our work shows that it is possible to improve the performance of a high-level language by using auto-tuning. We perform auto-tuning on a large optimization space on GPU kernels, focusing on loop permutation, loop unrolling, tiling, and specifying which loop(s) to parallelize, and show results on convolution kernels, codes in the PolyBench suite, and an implementation of belief propagation for stereo vision. The results show that our auto-tuned HMPP-generated implementations are significantly faster than the default HMPP implementation and can meet or exceed the performance of manually coded CUDA / OpenCL implementations.

show abstract

A compiler framework for optimization of affine loop nests for gpgpus

Cited by 167 publications

References 19 publications

A quantitative performance analysis model for GPU architectures

A quantitative performance analysis model for GPU architectures

Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors

Auto-tuning a high-level language targeted to GPU codes

Contact Info

Product

Resources

About