New Parallel (Rank-Revealing) QR Factorization Algorithms

Cunha, Rudnei Dias da; Becker, Dulceneia; Patterson, James C.

doi:10.1007/3-540-45706-2_94

Cited by 13 publications

(19 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…term suffices to make the flop count (2mn 2 −2n 3 /3)/P plus lower-order terms. This is because the parallel CAQR flop count (Equation (15) in Section 13.1.2) involves an additional (4B 2 /3 + 3BK/2)mn 2 log (. .…”

Section: Flopsmentioning

confidence: 99%

Communication-optimal Parallel and Sequential QR and LU Factorizations

Demmel¹,

Grigori²,

Hoemmen³

et al. 2012

SIAM J. Sci. Comput.

326

408

View full text Add to dashboard Cite

We present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform, and just as stable as Householder QR. Our first algorithm, Tall Skinny QR (TSQR), factors m × n matrices in a one-dimensional (1-D) block cyclic row layout, and is optimized for m n. Our second algorithm, CAQR (Communication-Avoiding QR), factors general rectangular matrices distributed in a two-dimensional block cyclic layout. It invokes TSQR for each block column factorization.The new algorithms are superior in both theory and practice. We have extended known lower bounds on communication for sequential and parallel matrix multiplication to provide latency lower bounds, and show these bounds apply to the LU and QR decompositions. We not only show that our QR algorithms attain these lower bounds (up to polylogarithmic factors), but that existing LAPACK and ScaLAPACK algorithms perform asymptotically more communication. We also point out recent LU algorithms in the literature that attain at least some of these lower bounds.Both TSQR and CAQR have asymptotically lower latency cost in the parallel case, and asymptotically lower latency and bandwidth costs in the sequential case. In practice, we have implemented parallel TSQR on several machines, with speedups of up to 6.7× on 16 processors of a Pentium III cluster, and up to 4× on 32 processors of a BlueGene/L. We have also implemented sequential TSQR on a laptop for matrices that do not fit in DRAM, so that slow memory is disk. Our out-of-DRAM implementation was as little as 2× slower than the predicted runtime as though DRAM were infinite.We have also modeled the performance of our parallel CAQR algorithm, yielding predicted speedups over ScaLAPACK's PDGEQRF of up to 9.7× on an IBM Power5, up to 22.9× on a model Petascale machine, and up to 5.3× on a model of the Grid.

show abstract

Section: Flopsmentioning

confidence: 99%

Communication-optimal Parallel and Sequential QR and LU Factorizations

Demmel¹,

Grigori²,

Hoemmen³

et al. 2012

SIAM J. Sci. Comput.

326

408

View full text Add to dashboard Cite

show abstract

“…233-236) and [3]); it is also possible to rewrite the Modified Gram-Schmidt algorithm such that numerical linear dependency between the columns may be detected. In our work we have used both this latter approach as well as the PRRQR algorithm [5] with good results. Now suppose that on step no.…”

Section: Ramifications Of Replacing Qr By Rrqr In the Block-arnoldi Amentioning

confidence: 90%

“…The RRQR factorization uses the PRRQR parallel algorithm (see [5]), which is based on the RRQR algorithm found in [8] (pp. 233-236).…”

Section: Parallel Implementation Detailsmentioning

confidence: 99%

“…233-236) and [3]) requires a few more operations than the non-pivoting, standard QR factorization, in case a matrix has full column-rank. Also, the authors have developed a parallel algorithm for the RRQR factorization [5] that provide good scalability with respect to the number of processors. Thus, in the cases where early convergence does not occur, there is no significant loss of parallel efficiency in the use of the dynamic block-GMRES over the standard block-GMRES.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Dynamic block GMRES: an iterative method for block linear systems

Cunha

Becker

2006

Adv Comput Math

View full text Add to dashboard Cite

We present variants of the block-GMRES(m) algorithms due to Vital and the block-LGMRES( m,k) by Baker, Dennis and Jessup, obtained with replacing the standard QR factorization by a rank-revealing QR factorization in the Arnoldi process. The resulting algorithm allows for dynamic block deflation whenever there is a linear dependency between the Krylov vectors or the convergence of a right-handside occurs. Fortran 90 implementations of the algorithms were tested on a number of test matrices and the results show that in some cases a substantial reduction of the execution time is obtained. Also a parallel implementation of our variant of the block-GMRES( m) algorithm, using Fortran 90 and MPI was tested on SunFire 15K parallel computer, showing good parallel efficiency.

show abstract

“…The introduction of several eliminators in a given column has a long history [9,10,11,12,13,14]. For shared-memory (multi-core) environments, recent work advocates the use of domain trees [15] to expose more parallelism with several eliminators while enforcing some locality within domains.…”

Section: Related Workmentioning

confidence: 99%

Implementing a Systolic Algorithm for QR Factorization on Multicore Clusters with PaRSEC

Aupy

Faverge

Robert

et al. 2014

Euro-Par 2013: Parallel Processing Workshops

View full text Add to dashboard Cite

This article introduces a new systolic algorithm for QR factorization, and its implementation on a supercomputing cluster of multicore nodes. The algorithm targets a virtual 3D-array and requires only local communications. The implementation of the algorithm uses threads at the node level, and MPI for internode communications. The complexity of the implementation is addressed with the PaRSEC software, which takes as input a parametrized dependence graph, which is derived from the algorithm, and only requires the user to decide, at the high-level, the allocation of tasks to nodes. We show that the new algorithm exhibits competitive performance with state-of-the-art QR routines on a supercomputer called Kraken, which shows that high-level programming environments, such as PaRSEC, provide a viable alternative to enhance the production of quality software on complex and hierarchical architectures.

show abstract

New Parallel (Rank-Revealing) QR Factorization Algorithms

Cited by 13 publications

References 11 publications

Communication-optimal Parallel and Sequential QR and LU Factorizations

Communication-optimal Parallel and Sequential QR and LU Factorizations

Dynamic block GMRES: an iterative method for block linear systems

Implementing a Systolic Algorithm for QR Factorization on Multicore Clusters with PaRSEC

Contact Info

Product

Resources

About