Proceedings of the 22nd Annual International Conference on Supercomputing 2008
DOI: 10.1145/1375527.1375562
|View full text |Cite
|
Sign up to set email alerts
|

A compiler framework for optimization of affine loop nests for gpgpus

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
113
0

Year Published

2011
2011
2017
2017

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 167 publications
(113 citation statements)
references
References 19 publications
0
113
0
Order By: Relevance
“…Compared to a set of recent studies on performance autotuning by empirical search [11,12,13,14,15], we provide an alternative optimization solution. Certainly search-based approaches are a powerful tool for optimization, but we note two disadvantages of such an approach.…”
Section: Introductionmentioning
confidence: 99%
“…Compared to a set of recent studies on performance autotuning by empirical search [11,12,13,14,15], we provide an alternative optimization solution. Certainly search-based approaches are a powerful tool for optimization, but we note two disadvantages of such an approach.…”
Section: Introductionmentioning
confidence: 99%
“…Thus even though the GPU part of the Intel heterogeneous processor has higher single precision theoretical peak performance than its CPU part, the delivered SpMV throughput is lower than expected. For the CSR-vector method, the low performance has another reason: small thread-bunch of size 8 dramatically increases loop overhead [40], which is one of the well known bottlenecks [41] of GPU programming. In Figures 4 and 5, we can see that on the AMD heterogeneous processor, our method delivers up to 71.90x (94.05x) and on average 22.17x (22.88x) speedup over the single (double) precision CSR-scalar method running on the used GPU.…”
Section: Performance Analysismentioning
confidence: 99%
“…Liu et al [25] looked at varying thread block dimensions and loop unrolling as optimizations in CUDA kernels. Baskaran et al [2] did an extensive study of loop unrolling within CUDA kernels, but they were concerned with improving the performance of a single application.…”
Section: Related Workmentioning
confidence: 99%
“…In this work, we use this approach to obtain optimized versions of 2D and 3D convolution kernels, codes in the PolyBench [1] suite, and an implementation of belief propagation for stereo vision. The main contributions of this paper are: (1) showing that auto-tuning applied to GPU kernels written in a directivebased language can be used to effectively parallelize and optimize a variety of codes, in many cases meeting or exceeding the performance of hand-written GPU programs, (2) showing how particular transformations affect performance and describing the best transformation configuration found for each kernel, and (3) showing that the best transformations are kernel and architecture specific.…”
Section: Introductionmentioning
confidence: 99%