Mathieu Vérité scite author profile

Vérité

et al. 2022

In this paper, we consider two fundamental symmetric kernels in linear algebra: the Cholesky factorization and the symmetric rankupdate (SYRK), with the classical three nested loops algorithms for these kernels. In addition, we consider a machine model with a fast memory of size and an unbounded slow memory. In this model, all computations must be performed on operands in fast memory, and the goal is to minimize the amount of communication between slow and fast memories. As the set of computations is fixed by the choice of the algorithm, only the ordering of the computations (the schedule) directly influences the volume of communications.We prove lower bounds of 1 3 √ 2 3 √ for the communication volume of the Cholesky factorization of an × symmetric positive definite matrix, and of 1 √ 2 2 √ for the SYRK computation of A • A T , where A is an × matrix. Both bounds improve the best known lower bounds from the literature by a factor √ 2. In addition, we present two out-of-core, sequential algorithms with matching communication volume: TBS for SYRK, with a volume of 1 √ 2 2 √ + O ( log ), and LBC for Cholesky, with a volume of 1 3 √ 2 3 √ + O ( 5/2). Both algorithms improve over the best known algorithms from the literature by a factor √ 2, and prove that the leading terms in our lower bounds cannot be improved further. This work shows that the operational intensity of symmetric kernels like SYRK or Cholesky is intrinsically higher (by a factor √ 2) than that of corresponding non-symmetric kernels (GEMM and LU factorization).

Data Distribution Schemes for Dense Linear Algebra Factorizations on Any Number of Nodes

Collin

et al. 2023

In this paper, we consider the problem of distributing the tiles of a dense matrix onto a set of homogeneous nodes. We consider both the case of non-symmetric (LU) and symmetric (Cholesky) factorizations. The efficiency of the well-known 2D Block-Cyclic (2DBC) distribution degrades significantly if the number of nodes P cannot be written as the product of two close numbers. Similarly, the recently introduced Symmetric Block Cyclic (SBC) distribution is only valid for specific values of P . In both contexts, we propose generalizations of these distributions to adapt them to any number of nodes. We show that this provides improvements to existing schemes (2DBC and SBC) both in theory and in practice, using the flexibility and ease of programming induced by task-based runtime systems like Chameleon and StarPU.

Symmetric Block-Cyclic Distribution: Fewer Communications Leads to Faster Dense Cholesky Factorization

Duchon

et al. 2022

We consider the distributed Cholesky factorization on homogeneous nodes. Inspired by recent progress on asymptotic lower bounds on the total communication volume required to perform Cholesky factorization, we present an original data distribution, Symmetric Block Cyclic (SBC), designed to take advantage of the symmetry of the matrix. We prove that SBC reduces the overall communication volume between nodes by a factor of square root of 2 compared to the standard 2D blockcyclic distribution. SBC can easily be implemented within the paradigm of task-based runtime systems. Experiments using the Chameleon library over the StarPU runtime system demonstrate that the SBC distribution reduces the communication volume as expected, and also achieves better performance and scalability than the classical 2D block-cyclic allocation scheme in all configurations. We also propose a 2.5D variant of SBC and prove that it further improves the communication and performance benefits.

I/O-Optimal Algorithms for Symmetric Linear Algebra Kernels

Beaumont¹,

Eyraud-Dubois²,

Vérité³

et al. 2022

Preprint

2D Static Resource Allocation for Compressed Linear Algebra and Communication Constraints

Vérité

2020

This paper adresses static resource allocation problems for irregular distributed parallel applications. More precisely, we focus on two classical tiled linear algebra kernels: the Matrix Multiplication and the LU decomposition algorithms on large linear systems. In the context of parallel distributed platforms, data exchanges can dramatically degrade the performance of linear algebra kernels and in this context, compression techniques such as Block Low Rank (BLR) are good candidates both for limiting data storage on each node and data exchanges between nodes. On the other hand, the use of BLR representation makes the static allocation problem of tiles to nodes more complex. Indeed, the workload associated to each tile depends on its compression factor, which induces an heterogeneous load balancing problem. In turn, solving this load balancing problem optimally might lead to complex allocation schemes, where the tiles allocated to a given node are scattered over the whole matrix. This in turn causes communication complexity problems, since matrix multiplication and LU decompositions heavily rely on broadcasting operations along rows and columns of processors, so that the communication volume is minimized when the number of different nodes on each row and column is minimized. In the fully homogeneous case, 2D Block cyclic allocation solves both load balancing and communication minimization issues simultaneously, but it might lead to bad load balancing in the heterogeneous case. Our goal in this paper is to propose data allocation schemes dedicated to BLR format and to prove that it is possible to obtain good performance on makespan when simultaneously balancing the load and minimizing the maximal number of different resources in any row or column.