Optimization space pruning without regrets

Beaugnon, Ulysse; Pouille, Antoine; Pouzet, Marc; Pienaar, J.; Cohen, Albert

doi:10.1145/3033019.3033023

Cited by 10 publications

(4 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, the influence of invalid configurations is not discussed, as they are mostly focussed on x86 and GPU architectures, where this issues is much less dominant. Another approach is followed by TELAMON [4], which avoids invalid configurations by relying on a constraint based, manually crafted hardware model. The model predicts the upper performance bound, while avoiding the construction of invalid configurations.…”

Section: Discussion and Related Workmentioning

confidence: 99%

HW-Aware Initialization of DNN Auto-Tuning to Improve Exploration Time and Robustness

Rieber¹,

Reiber²,

Bringmann³

et al. 2022

Preprint

View full text Add to dashboard Cite

The process of optimizing the latency of DNN operators with ML models and hardware-in-the-loop, called auto-tuning, has established itself as a pervasive method for the deployment of neural networks. From a search space of loop-optimizations, the candidate providing the best performance has to be selected. Performance of individual configurations is evaluated through hardware measurements. The combinatorial explosion of possible configurations, together with the cost of hardware evaluation makes exhaustive explorations of the search space infeasible in practice. Machine Learning methods, like random forests or reinforcement learning are used to aid in the selection of candidates for hardware evaluation. For general purpose hardware like x86 and GPGPU architectures impressive performance gains can be achieved, compared to hand-optimized libraries like cuDNN. The method is also useful in the space of hardware accelerators with less wide-spread adoption, where a high-performance library is not always available. However, hardware accelerators are often less flexible with respect to their programming which leads to operator configurations not executable on the hardware target. This work evaluates how these invalid configurations affect the auto-tuning process and its underlying performance prediction model for the VTA hardware. From these results, a validity-driven initialization method for AutoTVM is developed, only requiring 41.6% of the necessary hardware measurements to find the best solution, while improving search robustness.

show abstract

Section: Discussion and Related Workmentioning

confidence: 99%

HW-Aware Initialization of DNN Auto-Tuning to Improve Exploration Time and Robustness

Rieber¹,

Reiber²,

Bringmann³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Coloured petri nets [20] were proposed for GPGPU performance modelling. Another approach [3] builds an analytical performance model to determine the lower bound on execution time. Low-level GPU ISA solving and assembly microbenchmarking [38] has been used to collect data about architectural features and performance.…”

Section: Related Workmentioning

confidence: 99%

High-level hardware feature extraction for GPU performance prediction of stencils

Remmelg

Hagedorn

et al. 2020

Proceedings of the 13th Annual Workshop on General Purpose Processing Using Graphics Processing Unit

View full text Add to dashboard Cite

High-level functional programming abstractions have started to show promising results for HPC (High-Performance Computing). Approaches such as Lift, Futhark or Delite have shown that it is possible to have both, high-level abstractions and performance, even for HPC workloads such as stencils. In addition, these highlevel functional abstractions can also be used to represent programs and their optimized variants, within the compiler itself. However, such high-level approaches rely heavily on the compiler to optimize programs which is notoriously hard when targeting GPUs.Compilers either use hand-crafted heuristics to direct the optimizations or iterative compilation to search the optimization space. The irst approach has fast compile times, however, it is not performance-portable across diferent devices and requires a lot of human efort to build the heuristics. Iterative compilation, on the other hand, has the ability to search the optimization space automatically and adapts to diferent devices. However, this process is often very time-consuming as thousands of variants have to be evaluated. Performance models based on statistical techniques have been proposed to speedup the optimization space exploration. However, they rely on low-level hardware features, in the form of performance counters or low-level static code features.Using the Lift framework, this paper demonstrates how lowlevel, GPU-speciic features are extractable directly from a highlevel functional representation. The Lift IR (Intermediate Representation) is in fact a very suitable choice since all optimization choices are exposed at the IR level. This paper shows how to extract low-level features such as number of unique cache lines accessed per warp, which is crucial for building accurate GPU performance models. Using this approach, we are able to speedup the exploration of the space by a factor 2000x on an AMD GPU and 450x on Nvidia on average across many stencil applications.

show abstract

“…We complement the TAG algorithm with a performance model of the candidates [2]. The model provides a lower bound on the execution time of all implementations derivable from a candidate.…”

Section: Search Strategymentioning

confidence: 99%

“…The lower bound performance model mentioned in Section 5 could not work if it just had access to an intermediate implementation in the compilation process. A similar performance model relying on ad-hoc partial implementations was previously introduced [2]. We generalize the idea by encoding partial implementations as a CSP problem on top of a semantic backbone.…”

Section: Global Heuristicsmentioning

confidence: 99%

On the Representation of Partially Specified Implementations and its Application to the Optimization of Linear Algebra Kernels on GPU

Beaugnon,

Clément,

Tollenaere

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

Traditional optimizing compilers rely on rewrite rules to iteratively apply program transformations. This iterative approach hides optimization opportunities behind intermediate transformation steps. For instance, vectorization can only be applied to the innermost loop in a nest: one must first perform a loop interchange before even considering vectorization of an outer loop. In contrast, we propose an implementation framework representing programs as sets of possible implementation decisions. Specifying one decision can have an impact on others in a bidirectional manner: specifying that a loop must be vectorized prevents other loops from being nested inside it; conversely, specifying a loop as an outer loop will prevent it from being vectorized. These optimization decisions commute, obviating the pass ordering problem. We present a constraint programming system to formally define, represent and explore such implementation spaces. We also propose an exploration strategy combining tree search and branch-and-bound; the strength and novelty of this strategy reside in an analytical model of the lower bound on the execution time of a set of possible implementations. We showcase our approach on the construction and exploration of an implementation space for linear algebra kernels running on GPUs. We show this search space is expressive enough to represent complex decisions that fundamentally change the structure of the generated code. We also present preliminary results competitive with the performance of native GPU libraries.

show abstract

Optimization space pruning without regrets

Cited by 10 publications

References 27 publications

HW-Aware Initialization of DNN Auto-Tuning to Improve Exploration Time and Robustness

HW-Aware Initialization of DNN Auto-Tuning to Improve Exploration Time and Robustness

High-level hardware feature extraction for GPU performance prediction of stencils

On the Representation of Partially Specified Implementations and its Application to the Optimization of Linear Algebra Kernels on GPU

Contact Info

Product

Resources

About