Diesel: DSL for linear algebra and neural net computations on GPUs

Elango, Venmugil; Rubin, Norm; Ravishankar, M.; Sandanagobalane, Hariharan; Grover, Vinod

doi:10.1145/3211346.3211354

Cited by 43 publications

(27 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Matrix multiplication is the most tuned computation kernel in history: The missing optimizations are all well known and may be found in use cases and open-source implementations such like CUTLASS [36]. Alternatively, polyhedral compilation has been shown to match or outperform cuBLAS, provided sufficient target-and operator-specific information has been captured in the optimization heuristic and code generator [20]. While our scientific focus was on covering a wide range of layers with TC, a production release would need to embed such operator-specific strategies as well.…”

Section: Performance Resultsmentioning

confidence: 99%

“…The polyhedral framework of compilation emerged as a natural candidate to design a versatile optimization flow satisfying the needs of the domain and target hardware. It has demonstrated strong results in domain-specific optimization [5,9,20,46], expert-driven meta-programming [6,15,26], embedding of third-party library code [40], and automatic generation of efficient code for heterogeneous targets [5,7,43,51,70,77]. We attempt to take the best of both worlds, defining a domain-specific language rich enough to capture full sub-graphs of modern Machine Learning (ML) models while enabling aggressive compilation competitive to native libraries.…”

Section: Introductionmentioning

confidence: 99%

“…We attempt to take the best of both worlds, defining a domain-specific language rich enough to capture full sub-graphs of modern Machine Learning (ML) models while enabling aggressive compilation competitive to native libraries. In doing so, we may temporarily sacrifice some of the performance of über-optimized large matrix multiplications (e.g., compared to the recent Diesel polyhedral compiler [20]) while providing full automation and ML framework integration. Note that there is no fundamental difficulty in combining both approaches, recognizing and linking external library kernels when appropriate, as illustrated in Section 3.7.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

The Next 700 Accelerated Layers

Vasilache

Zinenko

Theodoridis

et al. 2019

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Deep learning frameworks automate the deployment, distribution, synchronization, memory allocation, and hardware acceleration of models represented as graphs of computational operators. These operators wrap high-performance libraries such as cuDNN or NNPACK. When the computation does not match any predefined library call, custom operators must be implemented, often at high engineering cost and performance penalty, limiting the pace of innovation. To address this productivity gap, we propose and evaluate: (1) a domain-specific language with a tensor notation close to the mathematics of deep learning; (2) a Just-In-Time optimizing compiler based on the polyhedral framework; (3) carefully coordinated linear optimization and evolutionary algorithms to synthesize high-performance CUDA kernels; (4) the transparent integration of our flow into PyTorch and Caffe2, providing the fully automatic synthesis of high-performance GPU kernels from simple tensor algebra. The performance is comparable to, and often exceeds the performance of, highly tuned libraries. CCS Concepts: • Software and its engineering → Compilers;

show abstract

Section: Performance Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

The Next 700 Accelerated Layers

Vasilache

Zinenko

Theodoridis

et al. 2019

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…(Vasilache et al 2018). Originally, Halide is designed for image processing pipeline on NVIDIA GPU, similar to Diesel (Elango et al 2018), and Tensor Comprehension is designed for machine learning applications. DSLs will abstract complex control logic comparing to traditional language like C/C++/Java.…”

Section: Discussionmentioning

confidence: 99%

Survey and design of paleozoic: a high-performance compiler tool chain for deep learning inference accelerator

Liu

Leng

et al. 2020

CCF Trans. HPC

View full text Add to dashboard Cite

Specialized hardware accelerators for deep learning are widely introduced by many hardware vendors because of their high performance and efficiency. However, different vendors adopt different accelerator architectures, making it challenging for the compiler tool-chain to generate and optimize high-performance codes. Moreover, the current tool-chains provided by the vendors are either highly abstract, which makes it hard to optimize or contain too many hardware-related details, which makes it inconvenient to program. So, in this paper, we propose a middle layer compiler tool-chain for Cambricon MLU-100 to fill the gap between high-level runtime library and low operator-level SDK. Our tool-chain is based on the operator level SDK but abstracts away its redundant initialization and allocation statement. We also expose the interface of major optimization knobs compared to the existing runtime, thus enabling a considerable optimization space. We evaluate our work by several state-of-the-art neural networks and choose the line of code and optimization knobs as evaluation metrics. We also compare the performance against state-of-the-art tool-chain TensorRT applying simple optimization strategy and find that our work has great potential in optimization. Our work can guarantee the user a vast optimization space with only around 20% amount of the codes that hides the redundant initialization and allocation statements from users.

show abstract

“…Diesel [9], NOVA [8], and PPCG [31] make heavy use of the polyhedral model for optimization. Fireiron generates nested affine loops and might therefore profit using polyhedral techniques too.…”

Section: Related Workmentioning

confidence: 99%

Fireiron

Hagedorn

Elliott²,

Barthels

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

Self Cite

View full text Add to dashboard Cite

High GPU performance can only be achieved if a kernel efficiently uses the multi-layered compute and memory hierarchies. For example, accelerators such as NVIDIA's Tensor Cores require specific mappings of threads to data that must be considered in data movements to and from registers. Current compilers struggle to match the performance of vendor libraries like cuBLAS, which are developed by experts in assembly. This manual low-level coding is time-consuming and complicates to unlock the full GPU potential, preventing experimentation to achieve even higher performance. In this paper we introduce Fireiron, a scheduling language aimed at performance experts. Fireiron provides high-level abstractions for expressing GPU optimizations that are unavailable to compilers today and which so far must be written in assembly. Our innovation is that both computations and data movements are first class concepts that can be separately mapped to threads, as required for the efficient use of specialized hardware like Tensor Cores. We evaluate Fireiron on three GPU architectures against expertwritten advanced matrix multiplications. First, we show that Fireiron schedules are able to express the strategies of these implementations requiring about 6× less lines of code. Second, we show that the code generated by Fireiron schedules outperforms the fastest implementations (provided by cuBLAS) by more than 2×.

show abstract

Diesel: DSL for linear algebra and neural net computations on GPUs

Cited by 43 publications

References 3 publications

The Next 700 Accelerated Layers

The Next 700 Accelerated Layers

Survey and design of paleozoic: a high-performance compiler tool chain for deep learning inference accelerator

Fireiron

Contact Info

Product

Resources

About