Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions

Carson, Erin; Higham, Nicholas J.

doi:10.1137/17m1140819

Cited by 147 publications

(173 citation statements)

References 32 publications

Supporting

Mentioning

167

Contrasting

Unclassified

Order By: Relevance

“…Replacing the direct triangular solves of the correction equation with an iterative method, as suggested in [4] in a mixed precision context, leads to "nesting" of two iterative methods, which in general are called "inner-outer" iterations, the latter having been studied both theoretically and computationally [9], [21], [23], including in mixed-precision computation scenarios [2]. Recently, Carson and Higham [4], [5] analyzed the convergence property of a three precision iterative refinement scheme (factorization precision, working precision, residual precision) and concluded that if the condition number of A is not too large, κ ∞ (A) = A ∞ A −1 ∞ < 10 4 , then using FP16 for the O(n 3 ) portion (the LU factorization) and (FP32, FP64) or (FP64, FP128) as the (working, residual) precision for the O(n 2 ) portion (refinement loop), one can expect to achieve forward error and backward error on the order of 10 −8 and 10 −16 respectively. We note that, ifx is the solution of Ax = b the forward error is defined by x − x ∞ / x ∞ and the backward error is defined by r 2 / A 2 x 2 where r = b−Ax.…”

Section: Related Workmentioning

confidence: 99%

“…The convergence tolerance is chosen of the order of the unit roundoff of the low precision arithmetic used during the factorization (e.g., we use 10 −4 or 10 −8 for when the LU is in FP16 or FP32 respectively). Since this paper focuses on practical usage and possible performance gains rather than error analysis, we point the reader to [4], [5] for detailed error analysis of the IR and IRGM techniques.…”

Section: A Backgroundmentioning

confidence: 99%

See 1 more Smart Citation

Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers

Haidar

Tomov

Dongarra

et al. 2018

SC18: International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

168

118

View full text Add to dashboard Cite

Low-precision floating-point arithmetic is a powerful tool for accelerating scientific computing applications, especially those in artificial intelligence. Here, we present an investigation showing that other high-performance computing (HPC) applications can also harness this power. Specifically, we use the general HPC problem, Ax = b, where A is a large dense matrix, and a double precision (FP64) solution is needed for accuracy. Our approach is based on mixed-precision (FP16→FP64) iterative refinement, and we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly tuned implementations. These new methods show how using half-precision Tensor Cores (FP16-TC) for the arithmetic can provide up to 4× speedup. This is due to the performance boost that the FP16-TC provide as well as to the improved accuracy over the classical FP16 arithmetic that is obtained because the GEMM accumulation occurs in FP32 arithmetic.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: A Backgroundmentioning

confidence: 99%

Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers

Haidar

Tomov

Dongarra

et al. 2018

SC18: International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

168

118

View full text Add to dashboard Cite

show abstract

“…preconditioner memory transfers (11) data transfers (from memory) per iteration, where fpxx i denotes the precision format selected for the ith diagonal block of the preconditioner. The data transfer volume of the block-Jacobi preconditioner thus depends on the format employed to store the block inverse.…”

Section: Energy Modelmentioning

confidence: 99%

“…To avoid the previous two pitfalls, in our final experiment, we compute the total data transfers of a single iteration of the PCG method with the block-Jacobi preconditioner stored in fp64, fp32, fp16, or adaptive precision, see Equation (11). To obtain an estimated total data transfer volume, we then combine the data transfer volume per iteration with the number of iterations needed to reach convergence in each case, ignoring those cases for which half precision does not converge.…”

Section: Energy Modelmentioning

confidence: 99%

“…Carson and Higham provides a detailed error analysis of LU‐based mixed refinement approaches for ill‐conditioned systems. In another work of the aforementioned authors, the same authors go as far as using half precision for computing an LU preconditioner that is used in the solution process of a GMRES solver that is part of a mixed precision iterative refinement process.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Adaptive precision in block‐Jacobi preconditioning for iterative sparse linear system solvers

Anzt

Dongarra

Flegar

et al. 2018

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Summary We propose an adaptive scheme to reduce communication overhead caused by data movement by selectively storing the diagonal blocks of a block‐Jacobi preconditioner in different precision formats (half, single, or double). This specialized preconditioner can then be combined with any Krylov subspace method for the solution of sparse linear systems to perform all arithmetic in double precision. We assess the effects of the adaptive precision preconditioner on the iteration count and data transfer cost of a preconditioned conjugate gradient solver. A preconditioned conjugate gradient method is, in general, a memory bandwidth‐bound algorithm, and therefore its execution time and energy consumption are largely dominated by the costs of accessing the problem's data in memory. Given this observation, we propose a model that quantifies the time and energy savings of our approach based on the assumption that these two costs depend linearly on the bit length of a floating point number. Furthermore, we use a number of test problems from the SuiteSparse matrix collection to estimate the potential benefits of the adaptive block‐Jacobi preconditioning scheme.

show abstract

A customized precision format based on mantissa segmentation for accelerating sparse linear algebra

Grützmacher

Cojean

Flegar

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

In this work, we pursue the idea of radically decoupling the floating point format used for arithmetic operations from the format used to store the data in memory. We complement this idea with a customized precision memory format derived by splitting the mantissa (significand) of standard IEEE formats into segments, such that values can be accessed faster if lower accuracy is acceptable. Combined with precision-aware algorithms that dynamically adapt the data access accuracy to the numerical requirements, the customized precision memory format can render attractive runtime savings without impacting the memory footprint of the data or the accuracy of the final result. In an experimental analysis using the adaptive precision Jacobi method on diagonalizable test problems, we assess the benefits of the mantissa-segmenting customized precision format on recent multi-and manycore architectures.

show abstract

Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions

Cited by 147 publications

References 32 publications

Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers

Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers

Adaptive precision in block‐Jacobi preconditioning for iterative sparse linear system solvers

A customized precision format based on mantissa segmentation for accelerating sparse linear algebra

Contact Info

Product

Resources

About