Reorthogonalized block classical Gram–Schmidt

Barlow, Jesse L.; Smoktunowicz, Agata

doi:10.1007/s00211-012-0496-2

Cited by 23 publications

(64 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…4 The final result B(i, j) is computed by launching another CUDA kernel to perform another binary reduction among the thread blocks. Our implementation is designed to reduce the number of synchronizations among the threads while relying on the CUDA runtime and the parameter tuning to exploit the data locality.…”

Section: Figures 14(a) and 14(b) Then The Final Partial Results B(i J)mentioning

confidence: 99%

“…2 Previously, the blocked variants of TSQR have been studied [1,2,4]. To generate n + 1 orthonormal basis vectors, our CA-GMRES and CA-Lanczos [25] use block orthogonalization followed by TSQR with a step size of s, where the step size is equivalent to the block size in the blocked algorithm to orthogonalize n + 1 vectors (e.g., n = 60 and s = 15 in our experiments).…”

Section: Error Analysismentioning

confidence: 99%

See 1 more Smart Citation

Mixed-Precision Cholesky QR Factorization and Its Case Studies on Multicore CPU with Multiple GPUs

Yamazaki¹,

Tomov²,

Dongarra³

2015

SIAM J. Sci. Comput.

View full text Add to dashboard Cite

Abstract. To orthonormalize the columns of a dense matrix, the Cholesky QR (CholQR) requires only one global reduction between the parallel processing units and performs most of its computation using BLAS-3 kernels. As a result, compared to other orthogonalization algorithms, CholQR obtains superior performance on many of the current computer architectures, where the communication is becoming increasingly expensive compared to the arithmetic operations. This is especially true when the input matrix is tall-skinny. Unfortunately, the orthogonality error of CholQR depends quadratically on the condition number of the input matrix, and it is numerically unstable when the matrix is ill-conditioned. To enhance the stability of CholQR, we recently used mixed-precision arithmetic; the input and output matrices are in the working precision, but some of its intermediate results are accumulated in the doubled precision. In this paper, we analyze the numerical properties of this mixed-precision CholQR. Our analysis shows that by selectively using the doubled precision, the orthogonality error of the mixed-precision CholQR only depends linearly on the condition number of the input matrix. We provide numerical results to demonstrate the improved numerical stability of the mixed-precision CholQR in practice. We then study its performance. When the target hardware does not support the desired higher precision, software emulation is needed. For example, using software-emulated double-double precision for the working 64-bit double precision, the mixed-precision CholQR requires about 8.5× more floating-point instructions than that required by the standard CholQR. On the other hand, the increase in the communication cost using the double-double precision is less significant, and our performance results on multicore CPU with a different graphics processing unit (GPU) demonstrate that the overhead of using the double-double arithmetic is decreasing on a newer architecture, where the computation is becoming less expensive compared to the communication. As a result, with a latest NVIDIA GPU, the mixed-precision CholQR was only 1.4× slower than the standard CholQR. Finally, we present case studies of using the mixed-precision CholQR within communication-avoiding variants of Krylov subspace projection methods for solving a nonsymmetric linear system of equations and for solving a symmetric eigenvalue problem, on a multicore CPU with multiple GPUs. These case studies demonstrate that by using the higher precision for this small but critical segment of the Krylov methods, we can improve not only the overall numerical stability of the solvers but also, in some cases, their performance.

show abstract

Section: Figures 14(a) and 14(b) Then The Final Partial Results B(i J)mentioning

confidence: 99%

Section: Error Analysismentioning

confidence: 99%

Mixed-Precision Cholesky QR Factorization and Its Case Studies on Multicore CPU with Multiple GPUs

Yamazaki¹,

Tomov²,

Dongarra³

2015

SIAM J. Sci. Comput.

View full text Add to dashboard Cite

show abstract

“…For the purpose of the block reorthogonalization, the block CGS (BCGS) algorithm [27], which is also a variant of the CGS algorithm, is proposed. To improve the orthogonality of the resulting vectors, the BCGS algorithm with reorthogonalization (BCGS2 algorithm) [28] is preferable. The BCGS2 algorithm including the QR factorization can be implemented using mainly the matrix multiplications.…”

Section: Block Inverse Iteration Algorithm With Reorthogonalizationmentioning

confidence: 99%

A New Parallel Symmetric Tridiagonal Eigensolver Based on Bisection and Inverse Iteration Algorithms for Shared-Memory Multi-core Processors

Ishigami

Kimura

Nakamura

2015

2015 10th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC)

View full text Add to dashboard Cite

In order to accelerate the subset computation of eigenpairs for real symmetric tridiagonal matrices on sharedmemory multi-core processors, a parallel symmetric tridiagonal eigensolver is proposed, which computes eigenvalues of target matrices using the parallel bisection algorithm and computes the corresponding eigenvectors using the block inverse iteration algorithm with reorthogonalization (BIR algorithm). The BIR algorithm is based on the simultaneous inverse iteration (SI) algorithm, which is a variant of the inverse iteration algorithm, and is introduced to a block parameter. Since the BIR algorithm is mainly composed of the matrix multiplications, the proposed eigensolver is expected to accelerate the computation of eigenpairs even on massively parallel computers. Numerical experiments on shared-memory multi-core processors show that the BIR algorithm is faster than the SI algorithm and achieves the good parallel efficiency. In addition, many cases of the numerical experiments also show that the proposed eigensolver, including the parallel bisection and the BIR algorithm, is more accurate than the parallel implementation of other eigensolvers, such as the QR iteration algorithm, the divide-and-conquer algorithm, and the multiple relatively robust representations algorithm.

show abstract

“…To improve the accuracy of computed factors one can introduce the implementation with iterative refinement, where the Cholesky-like factorization is applied first to (1) and Ω (1) . The factor Q (1) is then obtained as Q (1) (2) to get the factors R (2) and Ω (2) .…”

Section: Cholesky-like Factorization Of Symmetric Indefinite Matricesmentioning

confidence: 99%

“…The factor Q (1) is then obtained as Q (1) (2) to get the factors R (2) and Ω (2) . The resulting factors are then (1) .…”

Section: Cholesky-like Factorization Of Symmetric Indefinite Matricesmentioning

confidence: 99%

Cholesky-Like Factorization of Symmetric Indefinite Matrices and Orthogonalization with Respect to Bilinear Forms

Rozložńık¹,

Okulicka-Dłużewska²,

Smoktunowicz³

2015

SIAM J. Matrix Anal. & Appl.

View full text Add to dashboard Cite

It is well known that orthogonalization of column vectors in a rectangular matrix B with respect to the bilinear form induced by a nonsingular symmetric indefinite matrix A can be eventually seen as its factorization B = QR that is equivalent to the Cholesky-like factorization in the form B T AB = R T ΩR, where R is upper triangular and Ω is a signature matrix. Under the assumption of nonzero principal minors of the matrix M = B T AB we give bounds for the conditioning of the triangular factor R in terms of extremal singular values of M and of only those principal submatrices of M where there is a change of sign in Ω. Using these results we study the numerical behavior of two types of orthogonalization schemes and we give the worst-case bounds for quantities computed in finite precision arithmetic. In particular, we analyze the implementation based on the Cholesky-like factorization of M and the Gram-Schmidt process with respect to the bilinear form induced by the matrix A. To improve the accuracy of computed results we consider also the Gram-Schmidt process with reorthogonalization and show that its behavior is similar to the scheme based on the Cholesky-like factorization with one step of iterative refinement. Introduction.For a real symmetric (in general indefinite) nonsingular matrix A ∈ R m,m and for a full-column rank matrix B ∈ R m,n (m ≥ n) we look for a factorization B = QR, where Q ∈ R m,n is so-called (A, Ω)-orthogonal, i.e., its columns are mutually orthogonal with respect to the bilinear form induced by the matrix A, with Q T AQ = Ω ∈ R n,n being a signature matrix Ω ∈ diag(±1), and where R ∈ R n,n is upper triangular with positive diagonal elements. Note that the full-column rank condition of the matrix B is not enough for the existence of the factors Q and R such that Q is (A, Ω)-orthogonal and R is upper triangular with positive diagonal entries. It is also easy to see that if the factorization B = QR exists, it can be regarded as an implicit Cholesky-like factorization of the symmetric indefinite matrix M = B T AB = R T ΩR (without its explicit computation), delivering the same upper triangular factor R. Conversely, given the Cholesky-like factorization of M , the (A, Ω)-orthogonal factor Q can be then recovered as Q = BR −1 . Such problems appear explicitly [15] or implicitly in many applications such as eigenvalue problems, matrix pencils and structure-preserving algorithms [21,25], saddle-point problems, and optimization with interior-point methods [13,36,29] or indefinite least squares problems [4,9,23,24].

show abstract

Reorthogonalized block classical Gram–Schmidt

Cited by 23 publications

References 15 publications

Mixed-Precision Cholesky QR Factorization and Its Case Studies on Multicore CPU with Multiple GPUs

Mixed-Precision Cholesky QR Factorization and Its Case Studies on Multicore CPU with Multiple GPUs

A New Parallel Symmetric Tridiagonal Eigensolver Based on Bisection and Inverse Iteration Algorithms for Shared-Memory Multi-core Processors

Cholesky-Like Factorization of Symmetric Indefinite Matrices and Orthogonalization with Respect to Bilinear Forms

Contact Info

Product

Resources

About