DNNFusion: accelerating deep neural networks execution with advanced operator fusion

Niu, Wei; Guan, Jiexiong; Wang, Yanzhi; Agrawal, Gagan; Ren, Bin

doi:10.1145/3453483.3454083

Cited by 75 publications

(20 citation statements)

References 74 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Existing work [23,24] has identified such problem in a more general scope and introduces a compiler-level optimization technique to exploit inter-operation and intra-operation parallelism. These solutions could further improve the parallel FFN computation efficiency in MoE systems.…”

Section: Discussionmentioning

confidence: 99%

Lita: Accelerating Distributed Training of Sparsely Activated Models

Li¹,

Jiang²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Scaling model parameters usually improves model quality, but at the price of high computation overhead. Sparsely activated models, usually in the form of Mixture of Experts (MoE) architecture, have constant computation cost over their dense counterparts, thus providing opportunities to train and serve a large model at a reasonable cost. However, the distributed training of an MoE model is prone to low efficiency, mainly due to the interleaved all-to-all communication during model computation.This paper makes three main contributions. First, we systematically analyze the all-to-all overhead in distributed training of MoE. Second, we propose a new communication scheduling scheme based on tensor partitioning that prioritizes the all-to-all operations over other communication, due to its blocking nature. Third, we introduce expert packing that reduces the all-to-all transfer size and incorporates optimizations to mitigate its overheads. Both techniques effectively tackle the all-to-all bottleneck, and we integrate them into a new system called Lita. Experiments on an A100 GPU testbed show that Lita improves the training step time of popular NLP models by up to 1.73x over the state-of-the-art.

show abstract

Section: Discussionmentioning

confidence: 99%

Lita: Accelerating Distributed Training of Sparsely Activated Models

Li¹,

Jiang²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…2 Graph-level: DNNs with many operators are commonly represented as directed acyclic graphs (DAGs), which use nodes to represent operators and edges to represent the data flow and dependency [17]. Single-model DAGs are usually sequential with limited parallelism like VGG, ResNets, MobileNets and EfficietNets, which have only one or two branches and thus exposes small scheduling space [23].…”

Section: A Challenges For Multi-tenant DL Computingmentioning

confidence: 99%

A Survey of Multi-Tenant Deep Learning Inference on GPU

Yu¹,

Wang²,

Shangguan³

et al. 2022

Preprint

View full text Add to dashboard Cite

Deep Learning (DL) models have achieved superior performance. Meanwhile, the computing hardware like NVIDIA GPUs also demonstrated strong computing scaling trends with 2× throughput and memory bandwidth for each generation. With such strong computing scaling of GPUs, multi-tenant deep learning inference by co-locating multiple DL models onto the same GPU become widely deployed to improve resource utilization, enhance serving throughput, and reduce energy cost, etc. However, achieving efficient multi-tenant DL inference is challenging which requires thorough full-stack system optimization. This survey aims to summarize and categorize the emerging challenges and optimization opportunities for multi-tenant DL inference on GPU. By overviewing the entire optimization stack, summarizing the multi-tenant computing innovations, and elaborating the recent technique advances, we hope that this survey could shed lights on new optimization perspectives and motivate novel works in future large-scale DL system optimization.

show abstract

“…AStitch is orthogonal with the above studies in that it focuses on generating high performance GPU kernels given a large group of memory-intensive operators just-in-time. Niu et al [34] make studies about fusion optimization for the inference on mobile devices, while AStitch targets both training and inference on industrial GPU vendors, showing different targets and techniques. Zheng et al [57] explore operator stitching with shared memory, and use a two-level cost-model based method for fusion pattern decision and codegen schedule selection.…”

Section: Related Workmentioning

confidence: 99%

AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures

Zheng

Yang

Zhao

et al. 2022

Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

This work reveals that memory-intensive computation is a rising performance-critical factor in recent machine learning models. Due to a unique set of new challenges, existing ML optimizing compilers cannot perform efficient fusion under complex two-level dependencies combined with just-in-time demand. They face the dilemma of either performing costly fusion due to heavy redundant computation, or skipping fusion which results in massive number of kernels. Furthermore, they often suffer from low parallelism due to the lack of support for real-world production workloads with irregular tensor shapes. To address these rising challenges, we propose AStitch, a machine learning optimizing compiler that opens a new multi-dimensional optimization space for memory-intensive ML computations. It systematically abstracts four operator-stitching schemes while considering multi-dimensional optimization objectives, tackles complex computation graph dependencies with novel hierarchical data reuse, and efficiently processes various tensor shapes via adaptive thread mapping. Finally, AStitch provides justin-time support incorporating our proposed optimizations for both ML training and inference. Although AStitch serves as a stand-alone compiler engine that is portable to any version of TensorFlow, its basic ideas can be generally applied to other ML frameworks and optimization compilers. Experimental results show that AStitch can

show abstract

DNNFusion: accelerating deep neural networks execution with advanced operator fusion

Cited by 75 publications

References 74 publications

Lita: Accelerating Distributed Training of Sparsely Activated Models

Lita: Accelerating Distributed Training of Sparsely Activated Models

A Survey of Multi-Tenant Deep Learning Inference on GPU

AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures

Contact Info

Product

Resources

About