Communication Lower Bounds of Convolutions in CNNs

Zhang, Xiaoyang; Xiao, Junmin; Tan, Guangming

doi:10.1145/3350755.3400267

Cited by 3 publications

(1 citation statement)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As the recent methodology mainly focuses on the workflow's specific properties which do not translate across different computational patterns, the recent lower bound theory seems hard to be applied to arbitrary computations such as convolutions, in which different sub-computations involve different computational patterns. How to establish a systematic I/O lower bound theory for convolutions based on the red-blue pebble game model is a big challenge [32]. Even if the lower bounds could be obtained, the theoretical minimum of I/O complexity is not easy to directly yield an efficient dataflow strategy.…”

Section: Introductionmentioning

confidence: 99%

I/O lower bounds for auto-tuning of convolutions in CNNs

Zhang

Xiao

Tan

2021

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Self Cite

View full text Add to dashboard Cite

Convolution is the most time-consuming part in the computation of convolutional neural networks (CNNs), which have achieved great successes in numerous practical applications. Due to the complex data dependency and the increase in the amount of model samples, the convolution suffers from high overhead on data movement (i.e., memory access). This work provides comprehensive analysis and methodologies to minimize the communication for the convolution in CNNs. With an in-depth analysis of the recent I/O complexity theory under the red-blue game model, we develop a general I/O lower bound theory for a composite algorithm which consists of several different sub-computations. Based on the proposed theory, we establish the data movement lower bound results for two main convolution algorithms in CNNs, namely the direct convolution and Winograd algorithm, which represents the direct and indirect implementations of a convolution respectively. Next, derived from I/O lower bound results, we design the near I/O-optimal dataflow strategies for the two main convolution algorithms by fully exploiting the data reuse. Furthermore, in order to push the envelope of performance of the near I/O-optimal dataflow strategies further, an aggressive design of auto-tuning based on I/O lower bounds, is proposed to search an optimal parameter configuration for the direct convolution and Winograd algorithm on GPU, such as the number of threads and the size of shared memory used in each thread block. Finally, experiment evaluation results on the direct convolution and Winograd algorithm show that our dataflow strategies with the auto-tuning approach can achieve about 3.32× performance speedup on average over cuDNN. In addition, compared with TVM, which represents the state-of-the-art technique for auto-tuning, not only our auto-tuning method based on I/O lower bounds can find the optimal parameter configuration faster, but also our solution has higher performance than the optimal solution provided by TVM.

show abstract

Section: Introductionmentioning

confidence: 99%

I/O lower bounds for auto-tuning of convolutions in CNNs

Zhang

Xiao

Tan

2021

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Self Cite

View full text Add to dashboard Cite

show abstract

DeepZoning: Re-accelerate CNN Inference with Zoning Graph for Heterogeneous Edge Cluster

Wang,

Ma,

Yang

et al. 2024

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Parallelizing CNN inference on heterogeneous edge clusters with data parallelism has gained popularity as a way to meet real-time requirements without sacrificing model accuracy. However, existing algorithms struggle to find optimal parallel granularity for complex CNNS, the structure of which is a directed acyclic graph (DAG) rather than a chain, and the parallel dimension is inflexible. To distribute the workload of modern CNNs on heterogeneous devices is also proven as NP-hard problem. In this paper, we introduce DeepZoning , a versatile and cooperative inference framework that combines both model and data parallelism to accelerate CNN inference. DeepZoning employs two algorithms at different levels: (1) a low-level Adaptive Workload Partition algorithm that uses linear programming and takes spatial and channel dimensions into optimization during the search for feature map distribution on heterogeneous devices, and (2) a high-level Model Partition algorithm that finds the optimal model granularity and organizes complex CNNs into sequential zones to balance communication and computation during execution. Our experimental evaluations show that DeepZoning is effective, achieving up to a 3.02 × speed improvement on our experimental prototype compared to state-of-the-art algorithms.

show abstract