A Sampling Based Strategy to Automatic Performance Tuning of GPU Programs

Feng, Wilson; Abdelrahman, Tarek S.

doi:10.1109/ipdpsw.2017.46

Cited by 7 publications

(5 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The authors show that their method can outperform the random search. Regression trees have been used to speed up autotuning in multiple studies [11,18,28]. The regression trees are built from a representative sample of the tuning space, and their precision can be improved during search [11].…”

Section: Hybrid Methodsmentioning

confidence: 99%

“…Regression trees have been used to speed up autotuning in multiple studies [11,18,28]. The regression trees are built from a representative sample of the tuning space, and their precision can be improved during search [11]. All the papers evaluate their approach using rather vast tuning spaces, e. g., testing all integer thread block sizes in the interval < 1, 1024 >, instead of using more rational sizes 2 n , n ∈< 5, 10 >.…”

Section: Hybrid Methodsmentioning

confidence: 99%

“…State-of-the-art methods for searching generic tuning spaces (i. e., including any tuning parameters) view their objective as a function of tuning parameters. They are based on mathematical optimization [6,39], or they use a surrogate performance/power model built from a sample of tuning space [18,28,10,11]. Because the function relating tuning parameters with the objective differs with hardware and input, those methods require the autotuning to be repeated from scratch when hardware or input changes.…”

Section: Introductionmentioning

confidence: 99%

“…The strength of our method is its ability to build a model using a particular GPU and input, and use this model to speed up autotuning of a kernel running on a different GPU or processing different input. This is possible because the method builds the model of relations between tuning parameters and performance counters and deduces the performance from performance counters, instead of relating tuning parameters directly to the performance, as is done in [6,39,18,28,10,11]. Compared to the performance itself, the tuning parameters affect performance counters in a more straightforward and stable way 1 .…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Using hardware performance counters to speed up autotuning convergence on GPUs

Filipovič¹,

Hozzová²,

Nezarat³

et al. 2021

Preprint

View full text Add to dashboard Cite

Nowadays, GPU accelerators are commonly used to speed up general-purpose computing tasks on a variety of hardware. However, due to the diversity of GPU architectures and processed data, optimization of codes for a particular type of hardware and specific data characteristics can be extremely challenging. The autotuning of performance-relevant sourcecode parameters allows for automatic optimization of applications and keeps their performance portable. Although the autotuning process typically results in code speed-up, searching the tuning space can bring unacceptable overhead if (i) the tuning space is vast and full of poorly-performing implementations, or (ii) the autotuning process has to be repeated frequently because of changes in processed data or migration to different hardware.In this paper, we introduce a novel method for searching tuning spaces. The method takes advantage of collecting hardware performance counters (also known as profiling counters) during empirical tuning. Those counters are used to navigate the searching process towards faster implementations. The method requires the tuning space to be sampled on any GPU. It builds a problem-specific model, which can be used during autotuning on various, even previously unseen inputs or GPUs. Using a set of five benchmarks, we experimentally demonstrate that our method can speed up autotuning when an application needs to be ported to different hardware or when it needs to process data with different characteristics. We also compared our method to state of the art and show that our method is superior in terms of the number of searching steps and typically outperforms other searches in terms of convergence time.

show abstract

Section: Hybrid Methodsmentioning

confidence: 99%

Section: Hybrid Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Using hardware performance counters to speed up autotuning convergence on GPUs

Filipovič¹,

Hozzová²,

Nezarat³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Not observing such constraints can result in grossly increased auto-tuning times because of the many infeasible configurations that must be examined and compiled. While some attempts have been made to address such constraints in OpenTuner [32,33], it is necessary to modify OpenTuner to express the constraints that exist in our work. This makes it difficult to make a fair comparison to OpenTuner.…”

Section: Related Workmentioning

confidence: 99%

A Strategy for Automatic Performance Tuning of Stencil Computations on GPUs

Garvey

Abdelrahman

2018

Scientific Programming

Self Cite

View full text Add to dashboard Cite

We propose and evaluate a novel strategy for tuning the performance of a class of stencil computations on Graphics Processing Units. The strategy uses a machine learning model to predict the optimal way to load data from memory followed by a heuristic that divides other optimizations into groups and exhaustively explores one group at a time. We use a set of 104 synthetic OpenCL stencil benchmarks that are representative of many real stencil computations. We first demonstrate the need for auto-tuning by showing that the optimization space is sufficiently complex that simple approaches to determining a high-performing configuration fail. We then demonstrate the effectiveness of our approach on NVIDIA and AMD GPUs. Relative to a random sampling of the space, we find configurations that are 12%/32% faster on the NVIDIA/AMD platform in 71% and 4% less time, respectively. Relative to an expert search, we achieve 5% and 9% better performance on the two platforms in 89% and 76% less time. We also evaluate our strategy for different stencil computational intensities, varying array sizes and shapes, and in combination with expert search.

show abstract