Ansor : Generating High-Performance Tensor Programs for Deep Learning

Zheng, Lianmin; Jia, Chengfan; Sun, Minmin; Wu, Zhao; Yu, Cody Hao; Haj-Ali, Ameer; Wang, Yida; Yang, Jun; Zhuo, Danyang; Sen, Koushik; Stoica, Ion

doi:10.48550/arxiv.2006.06762

Cited by 9 publications

(10 citation statements)

References 49 publications

(48 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…TVM [12,56] is a popular framework targets optimization of mainly computing intensive operators. People can provide schedules and TVM tunes the parameters automatically.…”

Section: Discussion About Tvmmentioning

confidence: 99%

See 1 more Smart Citation

FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads

Zheng¹,

Zhao²,

Long³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

We show in this work that memory intensive computations can result in severe performance problems due to off-chip memory access and CPU-GPU context switch overheads in a wide range of deep learning models. For this problem, current just-in-time kernel fusion and code generation techniques have limitations, such as kernel schedule incompatibilities and rough fusion plan exploration strategies. We propose FusionStitching, a Deep Learning compiler capable of fusing memory intensive operators, with varied data dependencies and non-homogeneous parallelism, into large GPU kernels to reduce global memory access and operation scheduling overhead automatically. FusionStitching explores large fusion spaces to decide optimal fusion plans with considerations of memory access costs, kernel calls and resource usage constraints. We thoroughly study the schemes to stitch operators together for complex scenarios. FusionStitching tunes the optimal stitching scheme just-in-time with a domain-specific cost model efficiently. Experimental results show that Fu-sionStitching can reach up to 2.78× speedup compared to TensorFlow and current state-of-the-art. Besides these experimental results, we integrated our approach into a compiler product and deployed it onto a production cluster for AI workloads with thousands of GPUs. The system has been in operation for more than 4 months and saves 7,000 GPU hours on average for approximately 30,000 tasks per month.

show abstract

“…TVM [12,56] is a popular framework targets optimization of mainly computing intensive operators. People can provide schedules and TVM tunes the parameters automatically.…”

Section: Discussion About Tvmmentioning

confidence: 99%

“…There are recent advances on code generation of compute intensive DNN layers. TVM [12], Ansor [56] and Halide [34] are capable to generate high performance kernels with well designed schedules. Ansor [56] also explores kernel fusion with tuning approach, with limited patterns supported.…”

Section: Related Workmentioning

confidence: 99%

FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads

Zheng¹,

Zhao²,

Long³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…TVM allows users to implement schedules for each new operator by hand. Due to the large number of operators involved, for each device we use TVM's default schedules; automatic schedule design [84] has yet to be incorporated into TVM. We then enable auto-tuning of parameter values within the schedule to find best performance.…”

Section: Implementation and Setupmentioning

confidence: 99%

Neural architecture search as program transformation exploration

Crowley

O’Boyle

2021

Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

Improving the performance of deep neural networks (DNNs) is important to both the compiler and neural architecture search (NAS) communities. Compilers apply program transformations in order to exploit hardware parallelism and memory hierarchy. However, legality concerns mean they fail to exploit the natural robustness of neural networks. In contrast, NAS techniques mutate networks by operations such as the grouping or bottlenecking of convolutions, exploiting the resilience of DNNs. In this work, we express such neural architecture operations as program transformations whose legality depends on a notion of representational capacity. This allows them to be combined with existing transformations into a unified optimization framework. This unification allows us to express existing NAS operations as combinations of simpler transformations. Crucially, it allows us to generate and explore new tensor convolutions. We prototyped the combined framework in TVM and were able to find optimizations across different DNNs, that significantly reduce inference time -over 3× in the majority of cases. Furthermore, our scheme dramatically reduces NAS search time. Code is available at this https url. CCS CONCEPTS• Computing methodologies → Machine learning; • Software and its engineering → Compilers.

show abstract

“…However, since it generates assembly code, it is not able to run on different architectures like ARM. Regarding deep learning applications, a plethora of compilerbased approaches has arisen [9,10,61,77]. AutoTVM [10] generates the best implementation for a specific DNN by extracting domain-specific features from a given low-level abstract syntax tree.…”

Section: Related Workmentioning

confidence: 99%

On the Anatomy of Predictive Models for Accelerating GPU Convolution Kernels and Beyond

Labini

Cianfriglia

Perri

et al. 2021

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Efficient HPC libraries often expose multiple tunable parameters, algorithmic implementations, or a combination of them, to provide optimized routines. The optimal parameters and algorithmic choices may depend on input properties such as the shapes of the matrices involved in the operation. Traditionally, these parameters are manually tuned or set by auto-tuners. In emerging applications such as deep learning, this approach is not effective across the wide range of inputs and architectures used in practice. In this work, we analyze different machine learning techniques and predictive models to accelerate the convolution operator and GEMM. Moreover, we address the problem of dataset generation, and we study the performance, accuracy, and generalization ability of the models. Our insights allow us to improve the performance of computationally expensive deep learning primitives on high-end GPUs as well as low-power embedded GPU architectures on three different libraries. Experimental results show significant improvement in the target applications from 50% up to 300% compared to auto-tuned and high-optimized vendor-based heuristics by using simple decision tree- and MLP-based models.

show abstract

Ansor : Generating High-Performance Tensor Programs for Deep Learning

Cited by 9 publications

References 49 publications

FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads

FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads

Neural architecture search as program transformation exploration

On the Anatomy of Predictive Models for Accelerating GPU Convolution Kernels and Beyond

Contact Info

Product

Resources

About