Automated derivation of parametric data movement lower bounds for affine programs

Matrix factorizations are among the most important building blocks of scientific computing. However, state-of-the-art libraries are not communication-optimal, underutilizing current parallel architectures. We present novel algorithms for Cholesky and LU factorizations that utilize an asymptotically communication-optimal 2.5D decomposition. We first establish a theoretical framework for deriving parallel I/O lower bounds for linear algebra kernels, and then utilize its insights to derive Cholesky and LU schedules, both communicating 3 /( √ ) elements per processor, where M is the local memory size. The empirical results match our theoretical analysis: our implementations communicate significantly less than Intel MKL, SLATE, and the asymptotically communication-optimal CANDMC and CAPITAL libraries. Our code outperforms these state-of-the-art libraries in almost all tested scenarios, with matrix sizes ranging from 2,048 to 524,288 on up to 512 CPU nodes of the Piz Daint supercomputer, decreasing the time-to-solution by up to three times. Our code is ScaLAPACK-compatible and available as an open-source library.

show abstract

“…) derived by Olivry et al [51]. Furthermore, to the best of our knowledge, this is the first parallel bound for this kernel.…”

Section: Cholesky Factorizationsupporting

confidence: 50%

“…| |. This result is more general than, e.g., polyhedral techniques [11,15,51] as it does not require loop nests to be affine. Instead, it solely relies on set algebra and combinatorial methods.…”

Section: Knowing the Number Of Different Values Each Takes We Bound The Number Of Different Access Vectors ( ℎ ) □mentioning

confidence: 90%

On the parallel I/O optimality of linear algebra kernels

Kwasniewski

Ben-Nun

Ziogas

et al. 2021

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

View full text Add to dashboard Cite

show abstract

“…In this section, we consider the problem of finding a symbolic lower bound on the volume of loads needed to perform an affine computation. We first present the main intuitions behind the partitioning method, which is one of the state-of-theart techniques to derive a symbolic lower bound [11,20,28]. We then provide two improvements on this method, namely reductions and small dimensions.…”

Section: Lower Bound On Data Movementmentioning

confidence: 99%

“…Computing an I/O complexity upper bound for an algorithm is the most reasonable way to assess the tightness of a lower bound. While this computation is usually done by hand using ad hoc techniques specific to each studied algorithm [1,12,23,28,31,36], Fauzia et al [15] proposed a heuristic that directly reasons on the CDAG, which unfortunately does not scale to real programs. Finding an upper bound for a fixed architecture can also be viewed as finding an optimized program transformation that minimizes data movement costs, which also implies being able to evaluate this cost.…”

Section: Related Workmentioning

confidence: 99%

“…Design of the first algorithm for computing a symbolic over-approximation of the data movement for a parametric (multi-dimensional) tiled version of an affine code; 2. Design of the first fully automated scheme for expressing as an operations research problem the minimization of this data movement expression; [28] for the derivation of tight I/O complexity lower bounds in the presence of small dimensions; 4. Integration of these techniques into a tool that computes, for a class of affine computations: 1. an arithmetic complexity; 2. proved lower and upper bounds on I/O complexity; 3. a suggested tiled code that minimizes data movement.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

IOOpt: automatic derivation of I/O complexity bounds for affine programs

Olivry

Iooss

Tollenaere

et al. 2021

Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation

Self Cite

View full text Add to dashboard Cite

This work was supported in part by the U.S. National Science Foundation through award 2018016, by MIAI Grenoble Alpes (ANR-19-P3IA-0003), and by the Bpifrance Programme d'Investissements d'Avenir (PIA) as part of the ES3CAP project.contraction and convolution kernels. Then we evaluate numerically the tightness of our bound using the convolution layers of Yolo9000 and representative tensor contractions from the TCCG benchmark suite. Finally, we show the pertinence of our I/O complexity model by reporting the running time of the recommended tiled code for the convolution layers of Yolo9000.

show abstract

Data Distribution Schemes for Dense Linear Algebra Factorizations on Any Number of Nodes

Beaumont

Collin

Eyraud-Dubois

et al. 2023

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

View full text Add to dashboard Cite

In this paper, we consider the problem of distributing the tiles of a dense matrix onto a set of homogeneous nodes. We consider both the case of non-symmetric (LU) and symmetric (Cholesky) factorizations. The efficiency of the well-known 2D Block-Cyclic (2DBC) distribution degrades significantly if the number of nodes P cannot be written as the product of two close numbers. Similarly, the recently introduced Symmetric Block Cyclic (SBC) distribution is only valid for specific values of P . In both contexts, we propose generalizations of these distributions to adapt them to any number of nodes. We show that this provides improvements to existing schemes (2DBC and SBC) both in theory and in practice, using the flexibility and ease of programming induced by task-based runtime systems like Chameleon and StarPU.

show abstract

Automated derivation of parametric data movement lower bounds for affine programs

Cited by 25 publications

References 27 publications

On the parallel I/O optimality of linear algebra kernels

On the parallel I/O optimality of linear algebra kernels

IOOpt: automatic derivation of I/O complexity bounds for affine programs

Data Distribution Schemes for Dense Linear Algebra Factorizations on Any Number of Nodes

Contact Info

Product

Resources

About