On the Representation of Partially Specified Implementations and its Application to the Optimization of Linear Algebra Kernels on GPU

Beaugnon, Ulysse; Clément, Basile; Tollenaere, Nicolas; Cohen, Albert

doi:10.48550/arxiv.1904.03383

Cited by 3 publications

(3 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The partially specified implementations IR [3] expresses programs using scalar arithmetic operators and iteration dimensions. The compiler optimizes the implementation through constraint satisfaction.…”

Section: Related Workmentioning

confidence: 99%

Mapping parallelism in a functional IR through constraint satisfaction: a case study on convolution for mobile GPUs

Mogers

Radu

et al. 2022

Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction

View full text Add to dashboard Cite

Graphics Processing Units (GPUs) are notoriously hard to optimize for manually. What is needed are good automatic code generators and optimizers. Accelerate, Futhark and Lift have demonstrated that using a functional approach is well suited to solve this challenge. Lift, for instance, uses a system of rewrite rules with a multi-stage approach. Algorithmic optimizations are first explored, followed by hardware-specific optimizations such as using shared memory and mapping parallelism.While the algorithmic exploration leads to correct transformed programs by construction, it is not necessarily true for the latter phase. Exploiting shared memory and mapping parallelism while ensuring correct synchronization is a delicate balancing act, and is hard to encode in a rewrite system. Currently, Lift relies on heuristics with ad-hoc mechanisms to check for correctness.This paper proposes to extract parallelization constraints automatically from a functional IR and use a solver to identify valid rewriting. Using a convolutional neural network on a mobile GPU as a use case, this approach matches the performance of the ARM Compute Library GEMM convolution and the TVM-generated kernel consuming between 2× and 3.6× less memory. Furthermore, a speedup of 12× is achieved over the ARM Compute Library direct convolution implementation.

show abstract

Section: Related Workmentioning

confidence: 99%

Mapping parallelism in a functional IR through constraint satisfaction: a case study on convolution for mobile GPUs

Mogers

Radu

et al. 2022

Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction

View full text Add to dashboard Cite

show abstract

“…31,32 The loop autotuner in Telamon also uses Monte-Carlo Tree Search. 52 In the tradition of Halide every level needs assigned a strategy and a schedule where not all loops have an assigned strategy is considered incomplete. Whereas in our approach, every loop is considered sequential until we add a pragma.…”

Section: Related Workmentioning

confidence: 99%

Autotuning PolyBench benchmarks with LLVM Clang/Polly loop optimization pragmas using Bayesian optimization

Kruse

Balaprakash

et al. 2021

Concurrency and Computation

View full text Add to dashboard Cite

We develop a ytopt autotuning framework that leverages Bayesian optimization to explore the parameter space search and compare four different supervised learning methods within Bayesian optimization and evaluate their effectiveness. We select six of the most complex PolyBench benchmarks and apply the newly developed LLVM Clang/Polly loop optimization pragmas to the benchmarks to optimize them. We then use the autotuning framework to optimize the pragma parameters to improve their performance. The experimental results show that our autotuning approach outperforms the other compiling methods to provide the smallest execution time for the benchmarks syr2k, 3mm, heat‐3d, lu, and covariance with two large datasets in 200 code evaluations for effectively searching the parameter spaces with up to 170,368 different configurations. We find that the Floyd–Warshall benchmark did not benefit from autotuning. To cope with this issue, we provide some compiler option solutions to improve the performance. Then we present loop autotuning without a user's knowledge using a simple mctree autotuning framework to further improve the performance of the Floyd–Warshall benchmark. We also extend the ytopt autotuning framework to tune a deep learning application.

show abstract

“…We also demonstrated the viability of the autotuning search space for loop transformations that has the straightforward representation as either a tree or a directed acyclic graph using mctree 31,32 . The loop autotuner in Telamon also uses Monte-Carlo Tree Search 52 . In the tradition of Halide every level needs assigned a strategy and a schedule where not all loops have an assigned strategy is considered incomplete.…”

Section: Related Workmentioning

confidence: 99%

Autotuning PolyBench Benchmarks with LLVM Clang/Polly Loop Optimization Pragmas Using Bayesian Optimization (extended version)

Wu¹,

Kruse²,

Balaprakash³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we develop a ytopt autotuning framework that leverages Bayesian optimization to explore the parameter space search and compare four different supervised learning methods within Bayesian optimization and evaluate their effectiveness. We select six of the most complex PolyBench benchmarks and apply the newly developed LLVM Clang/Polly loop optimization pragmas to the benchmarks to optimize them.We then use the autotuning framework to optimize the pragma parameters to improve their performance. The experimental results show that our autotuning approach outperforms the other compiling methods to provide the smallest execution time for the benchmarks syr2k, 3mm, heat-3d, lu, and covariance with two large datasets in 200 code evaluations for effectively searching the parameter spaces with up to 170,368 different configurations. We find that the Floyd-Warshall benchmark did not benefit from autotuning because Polly uses heuristics to optimize the benchmark to make it run much slower. To cope with this issue, we provide some compiler option solutions to improve the performance. Then we present loop autotuning without a user's knowledge using a simple mctree autotuning framework to further improve the performance of the Floyd-Warshall benchmark. We also extend the ytopt autotuning framework to tune a deep learning application.

show abstract

On the Representation of Partially Specified Implementations and its Application to the Optimization of Linear Algebra Kernels on GPU

Cited by 3 publications

References 26 publications

Mapping parallelism in a functional IR through constraint satisfaction: a case study on convolution for mobile GPUs

Mapping parallelism in a functional IR through constraint satisfaction: a case study on convolution for mobile GPUs

Autotuning PolyBench benchmarks with LLVM Clang/Polly loop optimization pragmas using Bayesian optimization

Autotuning PolyBench Benchmarks with LLVM Clang/Polly Loop Optimization Pragmas Using Bayesian Optimization (extended version)

Contact Info

Product

Resources

About