Communication-Optimal Convolutional Neural Nets

Demmel, James; Dinh, Grace

doi:10.48550/arxiv.1802.06905

Cited by 11 publications

(20 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Lemma 4.2. For τ ∈ Z + and the nested bilinear algorithm F = (A ⊗τ , B ⊗τ , C ⊗τ ), where X ⊗τ = τ i=1 X and (A, B, C) is the bilinear algorithm for Strassen's base algorithm (Definition 7.1), (3) is an expansion bound for F.…”

Section: Fast Matrix Multiplicationmentioning

confidence: 99%

“…Repeatedly applying Corollary 3.7, we find σ(k) remains a rank expansion lower bound for A ⊗τ as well as B ⊗τ and C ⊗τ . Then for k ∈ [7 τ ] and P ∈ P 3) to both sides leads to…”

Section: Fast Matrix Multiplicationmentioning

confidence: 99%

“…The bilinear rank of Strassen's algorithm is R = n log 2 (7) . Using the expansion bound for Strassen's algorithm from Lemma 4.2, we have 3) .…”

Section: Fast Matrix Multiplicationmentioning

confidence: 99%

“…Hung and Kung initiated the study of communication lower bounds by modeling the computation as a directed acyclic dependency graph (dependency DAG), and representing the data access patterns via a red-blue pebble game [2]. Since then, new techniques have been developed to derive more lower bounds, such as volumetric inequalities for nested loop programs [3,4,5,6] and analysis of expansion and separability of the dependency DAG [7,8,9]. These approaches derive closed form expressions lower bounds by considering a particular dependency DAG consisting of binary operations on scalar values.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Communication lower bounds for nested bilinear algorithms

Zhang

Solomonik

2021

Preprint

View full text Add to dashboard Cite

We develop lower bounds on communication in the memory hierarchy or between processors for nested bilinear algorithms, such as Strassen's algorithm for matrix multiplication. We build on a previous framework that establishes communication lower bounds by use of the rank expansion, or the minimum rank of any fixed size subset of columns of a matrix, for each of the three matrices encoding the bilinear algorithm. This framework provides lower bounds for any way of computing a bilinear algorithm, which encompasses a larger space of algorithms than by fixing a particular dependency graph. Nested bilinear algorithms include fast recursive algorithms for convolution, matrix multiplication, and contraction of tensors with symmetry. Two bilinear algorithms can be nested by taking Kronecker products between their encoding matrices.Our main result is a lower bound on the rank expansion of a matrix constructed by a Kronecker product derived from lower bounds on the rank expansion of the Kronecker product's operands. To prove this bound, we map a subset of columns from a submatrix to a 2D grid, collapse them into a dense grid, expand the grid, and use the size of the expanded grid to bound the number of linearly independent columns of the submatrix. We apply the rank expansion lower bounds to obtain novel communication lower bounds for nested Toom-Cook convolution, Strassen's algorithm, and fast algorithms for partially symmetric contractions.

show abstract

Section: Fast Matrix Multiplicationmentioning

confidence: 99%

Section: Fast Matrix Multiplicationmentioning

confidence: 99%

“…The bilinear rank of Strassen's algorithm is R = n log 2 (7) . Using the expansion bound for Strassen's algorithm from Lemma 4.2, we have 3) .…”

Section: Fast Matrix Multiplicationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Communication lower bounds for nested bilinear algorithms

Zhang

Solomonik

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Since analyzing programs with parametric sizes disallows the construction of an explicit Computation Directed Acyclic Graph (CDAG), some form of parameterization is often needed [18][19][20]. However, we argue that the widely-used approaches based on the Loomis-Whitney or the HBL inequalities [21][22][23] (a) are often too restrictive, requiring the programs to be expressed in the polyhedral model to count the points in the projection polytopes; (b) do not capture pebbling motifs such as recomputation [19]; or (c) are limited to single-statement programs [7, 21-23, 23, 24].…”

Section: Introductionmentioning

confidence: 99%

Pebbles, Graphs, and a Pinch of Combinatorics: Towards Tight I/O Lower Bounds for Statically Analyzable Programs

Kwasniewski,

Ben-Nun,

Gianinazzi

et al. 2021

Preprint

View full text Add to dashboard Cite

Determining I/O lower bounds is a crucial step in obtaining communication-efficient parallel algorithms, both across the memory hierarchy and between processors. Current approaches either study specific algorithms individually, disallow programmatic motifs such as recomputation, or produce asymptotic bounds that exclude important constants. We propose a novel approach for obtaining precise I/O lower bounds on a general class of programs, which we call Simple Overlap Access Programs (SOAP). SOAP analysis covers a wide variety of algorithms, from ubiquitous computational kernels to full scientific computing applications. Using the red-blue pebble game and combinatorial methods, we are able to bound the I/O of the SOAP-induced Computational Directed Acyclic Graph (CDAG), taking into account multiple statements, input/output reuse, and optimal tiling. To deal with programs that are outside of our representation (e.g., non-injective access functions), we describe methods to approximate them with SOAP. To demonstrate our method, we analyze 38 different applications, including kernels from the Polybench benchmark suite, deep learning operators, and -for the first time -applications in unstructured physics simulations, numerical weather prediction stencil compositions, and full deep neural networks. We derive tight I/O bounds for several linear algebra kernels, such as Cholesky decomposition, improving the existing reported bounds by a factor of two. For stencil applications, we improve the existing bounds by a factor of up to 14. We implement our method as an open-source tool, which can derive lower bounds directly from provided C code.

show abstract

Improving Strong-Scaling of CNN Training by Exploiting Finer-Grained Parallelism

Dryden

Maruyama

Benson

et al. 2019

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

View full text Add to dashboard Cite

B. Performance modelingWe make use of analytic models for some of our performance estimates, particularly communication. We use a linear model [21] for communication, where α is the latency and β is the inverse bandwidth. Then the cost to send a message between two nodes is α + βn. We additionally assume that the network is full-duplex and that there is no interference.Collective communication operations such as allreduce will be important for some operations; for these, we use the performance models of Thakur et al [22]. For distributed matrix multiplication, we use the performance models developed for the Elemental library [23]. C. NotationWe now define some notation for distributed tensors that will be used throughout this paper. Our notation is heavily based on the tensor notation developed for the FLAME project [23]- [25].A tensor is an M -dimensional array, where the size of dimension m is I m , and we write I = (I 0 , . . . , I M −1 ) to refer to the shape of an entire tensor.

show abstract

Communication-Optimal Convolutional Neural Nets

Cited by 11 publications

References 6 publications

Communication lower bounds for nested bilinear algorithms

Communication lower bounds for nested bilinear algorithms

Pebbles, Graphs, and a Pinch of Combinatorics: Towards Tight I/O Lower Bounds for Statically Analyzable Programs

Improving Strong-Scaling of CNN Training by Exploiting Finer-Grained Parallelism

Contact Info

Product

Resources

About