Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis 2017
DOI: 10.1145/3126908.3126939
|View full text |Cite
|
Sign up to set email alerts
|

Input-aware auto-tuning of compute-bound HPC kernels

Abstract: Efficient implementations of HPC applications for parallel architectures generally rely on external software packages (e.g., BLAS, LAPACK, CUDNN). While these libraries provide highly optimized routines for certain characteristics of inputs (e.g., square matrices), they generally do not retain optimal performance across the wide range of problems encountered in practice. In this paper, we present an input-aware auto-tuning framework for matrix multiplications and convolutions, ISAAC, which uses predictive mode… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
23
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 28 publications
(23 citation statements)
references
References 18 publications
0
23
0
Order By: Relevance
“…Table III summarizes the results of co-running Conv2DBackpropF ilter and Conv2DBackpropInput. The input size for the operations is par_input (32,8,8,2048). Given this input size, the number of threads to achieve the best performance for the two operations is 68.…”
Section: Motivation Examplesmentioning
confidence: 99%
See 1 more Smart Citation
“…Table III summarizes the results of co-running Conv2DBackpropF ilter and Conv2DBackpropInput. The input size for the operations is par_input (32,8,8,2048). Given this input size, the number of threads to achieve the best performance for the two operations is 68.…”
Section: Motivation Examplesmentioning
confidence: 99%
“…To address the second challenge, we can replace some timeconsuming operations (e.g. convolutions) in cuDNN with operations from an open-source library with better or comparable performance (e.g., ISAAC [32]) such that we can control intraop parallelism at runtime.…”
Section: B Future Workmentioning
confidence: 99%
“…Our approach is closely related to autotuning that searches for the best-performing optimization configuration [58], [59]. This technique is demonstrated to be effective for choosing algorithmic choices [60], tuning GPU code [61], [62], [63], optimizing structured parallel programs [64], [65], [66] and non-uniform memory access (NUMA) architectures [67], and more recently for deep neural networks [68]. Many of the prior works in this area employ an evolutionary-based approach by applying and profiling candidate optimization options to choose a good option to use.…”
Section: Domain-specific Optimizationsmentioning
confidence: 99%
“…For the sake of generality, we take both implementations into account. Then, we implement and run Apollo's object detection module using NVIDIA's CUTLASS [9], an open-source collection of CUDA C++ template abstractions for implementing high-performance matrix-multiplication, and ISAAC [13], an input-aware auto-tuning framework and code-generator for compute-bound HPC kernels.…”
Section: Other Challengesmentioning
confidence: 99%