I/O lower bounds for auto-tuning of convolutions in CNNs

Zhang, Xiaoyang; Xiao, Junmin; Tan, Guangming

doi:10.1145/3437801.3441609

Cited by 6 publications

(7 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Neural Networks. Analyzing I/O lower bounds of neural networks is a nascent field, and so far only single-layer convolution was analyzed [20,23]. We improve the previously-reported bound reported by Zhang et al [20] by a factor of 8.…”

Section: Discussionmentioning

confidence: 60%

“…The first asymptotic I/O lower bound for single-layer direct convolution was proved by Demmel et al [23]. Chen et al [55] propose a matching implementation, and Zhang et al [20] present the first non-asymptotic I/O lower bound for Winograd convolution.…”

Section: Related Workmentioning

confidence: 99%

“…Since analyzing programs with parametric sizes disallows the construction of an explicit Computation Directed Acyclic Graph (CDAG), some form of parameterization is often needed [18][19][20]. However, we argue that the widely-used approaches based on the Loomis-Whitney or the HBL inequalities [21][22][23] (a) are often too restrictive, requiring the programs to be expressed in the polyhedral model to count the points in the projection polytopes; (b) do not capture pebbling motifs such as recomputation [19]; or (c) are limited to single-statement programs [7, 21-23, 23, 24].…”

Section: Introductionmentioning

confidence: 99%

“…• Symbolic dataflow analysis that extends SOAP to multiplestatement programs, capturing input and output reuse between statements, as well as data recomputation. • I/O analysis of 38 scientific computing kernels, improving existing bounds [19,20] by up to a factor of 14, and new lower bounds for applications in deep learning, unstructured physics simulation, and numerical weather prediction.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Pebbles, Graphs, and a Pinch of Combinatorics: Towards Tight I/O Lower Bounds for Statically Analyzable Programs

Kwasniewski,

Ben-Nun,

Gianinazzi

et al. 2021

Preprint

View full text Add to dashboard Cite

Determining I/O lower bounds is a crucial step in obtaining communication-efficient parallel algorithms, both across the memory hierarchy and between processors. Current approaches either study specific algorithms individually, disallow programmatic motifs such as recomputation, or produce asymptotic bounds that exclude important constants. We propose a novel approach for obtaining precise I/O lower bounds on a general class of programs, which we call Simple Overlap Access Programs (SOAP). SOAP analysis covers a wide variety of algorithms, from ubiquitous computational kernels to full scientific computing applications. Using the red-blue pebble game and combinatorial methods, we are able to bound the I/O of the SOAP-induced Computational Directed Acyclic Graph (CDAG), taking into account multiple statements, input/output reuse, and optimal tiling. To deal with programs that are outside of our representation (e.g., non-injective access functions), we describe methods to approximate them with SOAP. To demonstrate our method, we analyze 38 different applications, including kernels from the Polybench benchmark suite, deep learning operators, and -for the first time -applications in unstructured physics simulations, numerical weather prediction stencil compositions, and full deep neural networks. We derive tight I/O bounds for several linear algebra kernels, such as Cholesky decomposition, improving the existing reported bounds by a factor of two. For stencil applications, we improve the existing bounds by a factor of up to 14. We implement our method as an open-source tool, which can derive lower bounds directly from provided C code.

show abstract

Section: Discussionmentioning

confidence: 60%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Pebbles, Graphs, and a Pinch of Combinatorics: Towards Tight I/O Lower Bounds for Statically Analyzable Programs

Kwasniewski,

Ben-Nun,

Gianinazzi

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…This novel technique is called deep reuse. Another recent but sophisticated approach [28] divides the Winograd algorithm in subcomputations and establishes the data movement lower bounds to finally define the optimal I/O dataflow that maximizes data re-reuse. Furthermore, they propose an auto-tuning technique to dynamically find the optimal parameter configuration.…”

Section: Related Workmentioning

confidence: 99%

OpenCNN: A Winograd Minimal Filtering Algorithm Implementation in CUDA

2021

View full text Add to dashboard Cite

Improving the performance of the convolution operation has become a key target for High Performance Computing (HPC) developers due to its prevalence in deep learning applied mainly to video processing. The improvement is being pushed by algorithmic and implementation innovations. Algorithmically, the convolution can be solved as it is mathematically enunciated, but other methods allow to transform it into a Fast Fourier Transform (FFT) or a GEneral Matrix Multiplication (GEMM). In this latter group, the Winograd algorithm is a state-of-the-art variant that is specially suitable for smaller convolutions. In this paper, we present openCNN, an optimized CUDA C++ implementation of the Winograd convolution algorithm. Our approach achieves speedups of up to 1.76× on Turing RTX 2080Ti and up to 1.85× on Ampere RTX 3090 with respect to Winograd convolution in cuDNN 8.2.0. OpenCNN is released as open-source software.

show abstract

Automatic Generation of High-Performance Convolution Kernels on ARM CPUs for Deep Learning

Meng

Chen

et al. 2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

I/O lower bounds for auto-tuning of convolutions in CNNs

Cited by 6 publications

References 28 publications

Pebbles, Graphs, and a Pinch of Combinatorics: Towards Tight I/O Lower Bounds for Statically Analyzable Programs

Pebbles, Graphs, and a Pinch of Combinatorics: Towards Tight I/O Lower Bounds for Statically Analyzable Programs

OpenCNN: A Winograd Minimal Filtering Algorithm Implementation in CUDA

Automatic Generation of High-Performance Convolution Kernels on ARM CPUs for Deep Learning

Contact Info

Product

Resources

About