Automatic generation of specialized direct convolutions for mobile GPUs

Mogers, Naums; Radu, Valentin; Li, Lü; O’Boyle, Michael; Dubach, Christophe

doi:10.1145/3366428.3380771

Cited by 12 publications

(8 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Tuning constraint inference Tiling tuning Parallelization tuning Padding tuning GPGPU'20 [14] Code Generation CGO'17 [21] Figure 1. The entire optimization flow in Lift.…”

Section: Tuningmentioning

confidence: 99%

See 1 more Smart Citation

Mapping parallelism in a functional IR through constraint satisfaction: a case study on convolution for mobile GPUs

Mogers

Radu

et al. 2022

Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction

Self Cite

View full text Add to dashboard Cite

Graphics Processing Units (GPUs) are notoriously hard to optimize for manually. What is needed are good automatic code generators and optimizers. Accelerate, Futhark and Lift have demonstrated that using a functional approach is well suited to solve this challenge. Lift, for instance, uses a system of rewrite rules with a multi-stage approach. Algorithmic optimizations are first explored, followed by hardware-specific optimizations such as using shared memory and mapping parallelism.While the algorithmic exploration leads to correct transformed programs by construction, it is not necessarily true for the latter phase. Exploiting shared memory and mapping parallelism while ensuring correct synchronization is a delicate balancing act, and is hard to encode in a rewrite system. Currently, Lift relies on heuristics with ad-hoc mechanisms to check for correctness.This paper proposes to extract parallelization constraints automatically from a functional IR and use a solver to identify valid rewriting. Using a convolutional neural network on a mobile GPU as a use case, this approach matches the performance of the ARM Compute Library GEMM convolution and the TVM-generated kernel consuming between 2× and 3.6× less memory. Furthermore, a speedup of 12× is achieved over the ARM Compute Library direct convolution implementation.

show abstract

“…Tuning constraint inference Tiling tuning Parallelization tuning Padding tuning GPGPU'20 [14] Code Generation CGO'17 [21] Figure 1. The entire optimization flow in Lift.…”

Section: Tuningmentioning

confidence: 99%

“…The focus is on the convolution -the most compute-intensive operation [10] of a CNN architecture. Prior work [14] has shown how this kernel can be expressed and optimized in Lift. In contrast to prior work, the mapping of parallelism is performed automatically using constraints.…”

Section: Tuningmentioning

confidence: 99%

Mapping parallelism in a functional IR through constraint satisfaction: a case study on convolution for mobile GPUs

Mogers

Radu

et al. 2022

Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction

Self Cite

View full text Add to dashboard Cite

show abstract

“…In the future, we can combine DNNFusion's high-level abstraction to existing domain-specific polyhedral analysis. Similarly, another promising direction will be to integrate DNNFusion into other compilation-based DNN frameworks [25,45] or other popular general tensor/matrix/linear algebra computation frameworks, such as MLIR [40], Tiramisu [4], TACO [33,34], Halide [56], and LGen [38,64]. There also exist several other frameworks to optimize machine learning with operator fusion or fusion-based ideas.…”

Section: Related Workmentioning

confidence: 99%

DNNFusion: accelerating deep neural networks execution with advanced operator fusion

Niu

Guan

Wang

et al. 2021

Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation

View full text Add to dashboard Cite

Deep Neural Networks (DNNs) have emerged as the core enabler of many major applications on mobile devices. To achieve high accuracy, DNN models have become increasingly deep with hundreds or even thousands of operator layers, leading to high memory and computational requirements for inference. Operator fusion (or kernel/layer fusion) is key optimization in many state-of-the-art DNN execution frameworks, such as TensorFlow, TVM, and MNN, that aim to improve the efficiency of the DNN inference. However, these frameworks usually adopt fusion approaches based on certain patterns that are too restrictive to cover the diversity of operators and layer connections, especially those seen in many extremely deep models. Polyhedral-based loop fusion techniques, on the other hand, work on a low-level view of the computation without operator-level information, and can also miss potential fusion opportunities. To address this challenge, this paper proposes a novel and extensive loop fusion framework called DNNFusion. The basic idea of this work is to work at an operator view of DNNs, but expand fusion opportunities by developing a classification of both individual operators and their combinations. In addition, DNNFusion includes 1) a novel mathematical-propertybased graph rewriting framework to reduce evaluation costs and facilitate subsequent operator fusion, 2) an integrated fusion plan generation that leverages the high-level analysis and accurate light-weight profiling, and 3) additional optimizations during fusion code generation. DNNFusion is extensively evaluated on 15 DNN models with varied types

show abstract

“…Compiler optimization. There has been much interest in autotuning DNN code generators [10,23,37,48,64,72]. Polyhedral compilers are particularly well-suited [72,83] as they have in-built abstractions for exploiting parallelism and memory layout in a principled form.…”

Section: Interpolating Between Modelsmentioning

confidence: 99%

Neural architecture search as program transformation exploration

Crowley

O’Boyle

2021

Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

Self Cite

View full text Add to dashboard Cite

Improving the performance of deep neural networks (DNNs) is important to both the compiler and neural architecture search (NAS) communities. Compilers apply program transformations in order to exploit hardware parallelism and memory hierarchy. However, legality concerns mean they fail to exploit the natural robustness of neural networks. In contrast, NAS techniques mutate networks by operations such as the grouping or bottlenecking of convolutions, exploiting the resilience of DNNs. In this work, we express such neural architecture operations as program transformations whose legality depends on a notion of representational capacity. This allows them to be combined with existing transformations into a unified optimization framework. This unification allows us to express existing NAS operations as combinations of simpler transformations. Crucially, it allows us to generate and explore new tensor convolutions. We prototyped the combined framework in TVM and were able to find optimizations across different DNNs, that significantly reduce inference time -over 3× in the majority of cases. Furthermore, our scheme dramatically reduces NAS search time. Code is available at this https url. CCS CONCEPTS• Computing methodologies → Machine learning; • Software and its engineering → Compilers.

show abstract

Automatic generation of specialized direct convolutions for mobile GPUs

Cited by 12 publications

References 12 publications

Mapping parallelism in a functional IR through constraint satisfaction: a case study on convolution for mobile GPUs

Mapping parallelism in a functional IR through constraint satisfaction: a case study on convolution for mobile GPUs

DNNFusion: accelerating deep neural networks execution with advanced operator fusion

Neural architecture search as program transformation exploration

Contact Info

Product

Resources

About