2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2019
DOI: 10.1109/ipdps.2019.00020
|View full text |Cite
|
Sign up to set email alerts
|

Communication-Avoiding Cholesky-QR2 for Rectangular Matrices

Abstract: Scalable QR factorization algorithms for solving least squares and eigenvalue problems are critical given the increasing parallelism within modern machines. We introduce a more general parallelization of the CholeskyQR2 algorithm and show its effectiveness for a wide range of matrix sizes. Our algorithm executes over a 3D processor grid, the dimensions of which can be tuned to trade-off costs in synchronization, interprocessor communication, computational work, and memory footprint. We implement this algorithm… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
21
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 11 publications
(21 citation statements)
references
References 35 publications
0
21
0
Order By: Relevance
“…Furthermore, to secure high performance, we carefully tune block sizes and communication routines to maximize the efficiency of local computations such as trsm (triangular solve) and gemm (matrix multiplication). We measure both communication volume and achieved performance of COnf LUX and COnf CHOX and compare them to state-ofthe-art libraries: a vendor-optimized Intel MKL [34], SLATE [28] recent library targeting exascale systems), as well as CANDMC [57,58] and CAPITAL [32,33] (codes based on the asymptotically optimal 2.5D decomposition). In our experiments on the Piz Daint supercomputer, we measure up to 1.6x communication reduction compared to the second-best implementation.…”
Section: Number Of Nodesmentioning
confidence: 99%
See 4 more Smart Citations
“…Furthermore, to secure high performance, we carefully tune block sizes and communication routines to maximize the efficiency of local computations such as trsm (triangular solve) and gemm (matrix multiplication). We measure both communication volume and achieved performance of COnf LUX and COnf CHOX and compare them to state-ofthe-art libraries: a vendor-optimized Intel MKL [34], SLATE [28] recent library targeting exascale systems), as well as CANDMC [57,58] and CAPITAL [32,33] (codes based on the asymptotically optimal 2.5D decomposition). In our experiments on the Piz Daint supercomputer, we measure up to 1.6x communication reduction compared to the second-best implementation.…”
Section: Number Of Nodesmentioning
confidence: 99%
“…Matrix factorizations are included in most of linear solvers' libraries. With regard to the parallelization strategy, these libraries may be categorized into three groups: task-based: SLATE [28] (OpenMP tasks), DLAF [35] (HPX tasks), DPLASMA [12] (DaGuE scheduler), or CHAMELEON [3] (StarPU tasks); static 2D parallel: MKL [34], Elemental [53], or Cray LibSci [16]; communicationminimizing 2.5D parallel: CANDMC [57] and CAPITAL [33]. In the last decade, heavy focus was placed on heterogeneous architectures.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations