Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems

Haidar, Azzam; Bayraktar, Harun H.; Tomov, Stanimire; Dongarra, Jack; Higham, Nicholas J.

doi:10.1098/rspa.2020.0110

Cited by 46 publications

(32 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The work illustrates that mixed-precision techniques can be of great interest for linear solvers in many engineering areas. The results show that on single NVIDIA V100 GPU, the new solvers can be up to 4× faster than an optimized double-precision solver (Haidar et al, 2017, 2018a, 2018b, 2020).…”

Section: Dense Linear Algebramentioning

confidence: 99%

A survey of numerical linear algebra methods utilizing mixed-precision arithmetic

Abdelfattah

Anzt

Boman

et al. 2021

The International Journal of High Performance Computing Applica

Self Cite

108

View full text Add to dashboard Cite

The efficient utilization of mixed-precision numerical linear algebra algorithms can offer attractive acceleration to scientific computing applications. Especially with the hardware integration of low-precision special-function units designed for machine learning applications, the traditional numerical algorithms community urgently needs to reconsider the floating point formats used in the distinct operations to efficiently leverage the available compute power. In this work, we provide a comprehensive survey of mixed-precision numerical linear algebra routines, including the underlying concepts, theoretical background, and experimental results for both dense and sparse linear algebra problems.

show abstract

Section: Dense Linear Algebramentioning

confidence: 99%

A survey of numerical linear algebra methods utilizing mixed-precision arithmetic

Abdelfattah

Anzt

Boman

et al. 2021

The International Journal of High Performance Computing Applica

Self Cite

108

View full text Add to dashboard Cite

show abstract

“…The answers to these questions are of wide interest because these accelerators, despite being introduced to accelerate the training of deep neural networks (NVIDIA, 2017, p. 12), are increasingly being used in general-purpose scientific computing, where their fast low precision arithmetic can be exploited in mixed-precision algorithms (Abdelfattah et al, 2020), for example in iterative refinement for linear systems (Haidar et al, 2018a(Haidar et al, , 2018b(Haidar et al, , 2020.…”

Section: Year Of Releasementioning

confidence: 99%

Numerical behavior of NVIDIA tensor cores

Fasi

Higham

Mikaitis

et al. 2021

PeerJ Computer Science

Self Cite

View full text Add to dashboard Cite

We explore the floating-point arithmetic implemented in the NVIDIA tensor cores, which are hardware accelerators for mixed-precision matrix multiplication available on the Volta, Turing, and Ampere microarchitectures. Using Volta V100, Turing T4, and Ampere A100 graphics cards, we determine what precision is used for the intermediate results, whether subnormal numbers are supported, what rounding mode is used, in which order the operations underlying the matrix multiplication are performed, and whether partial sums are normalized. These aspects are not documented by NVIDIA, and we gain insight by running carefully designed numerical experiments on these hardware units. Knowing the answers to these questions is important if one wishes to: (1) accurately simulate NVIDIA tensor cores on conventional hardware; (2) understand the differences between results produced by code that utilizes tensor cores and code that uses only IEEE 754-compliant arithmetic operations; and (3) build custom hardware whose behavior matches that of NVIDIA tensor cores. As part of this work we provide a test suite that can be easily adapted to test newer versions of the NVIDIA tensor cores as well as similar accelerators from other vendors, as they become available. Moreover, we identify a non-monotonicity issue affecting floating point multi-operand adders if the intermediate results are not normalized after each step.

show abstract

“…A decade after the two-precision iterative refinement work by Buttari et al, Carson and Higham introduced a GMRES-based iterative refinement algorithm that uses up to three precisions for the solution of linear systems ( Carson & Higham, 2017 ; Carson & Higham, 2018 ). This algorithm enabled Haidar et al ( Haidar et al., 2018a ; Haidar et al., 2020 ; Haidar et al., 2018b ) to successfully exploit the half-precision floating-point arithmetic units of NVIDIA tensor cores in the solution of linear systems. Compared with linear solvers using exclusively double precision, their implementation shows up to a 4×–5× speedup while still delivering double precision accuracy ( Haidar et al., 2020 ; Haidar et al., 2018b ).…”

Section: Introductionmentioning

confidence: 99%

“…This algorithm enabled Haidar et al ( Haidar et al., 2018a ; Haidar et al., 2020 ; Haidar et al., 2018b ) to successfully exploit the half-precision floating-point arithmetic units of NVIDIA tensor cores in the solution of linear systems. Compared with linear solvers using exclusively double precision, their implementation shows up to a 4×–5× speedup while still delivering double precision accuracy ( Haidar et al., 2020 ; Haidar et al., 2018b ). This algorithm is now implemented in the MAGMA library ( Agullo et al, 2009 ; Magma, 2021 ) (routine ) and in , the NVIDIA library that provides LAPACK-like routines (routine ).…”

Section: Introductionmentioning

confidence: 99%

Performance impact of precision reduction in sparse linear systems solvers

Zounon

Higham

Lucas

et al. 2022

PeerJ Computer Science

Self Cite

View full text Add to dashboard Cite

It is well established that reduced precision arithmetic can be exploited to accelerate the solution of dense linear systems. Typical examples are mixed precision algorithms that reduce the execution time and the energy consumption of parallel solvers for dense linear systems by factorizing a matrix at a precision lower than the working precision. Much less is known about the efficiency of reduced precision in parallel solvers for sparse linear systems, and existing work focuses on single core experiments. We evaluate the benefits of using single precision arithmetic in solving a double precision sparse linear system using multiple cores. We consider both direct methods and iterative methods and we focus on using single precision for the key components of LU factorization and matrix–vector products. Our results show that the anticipated speedup of 2 over a double precision LU factorization is obtained only for the very largest of our test problems. We point out two key factors underlying the poor speedup. First, we find that single precision sparse LU factorization is prone to a severe loss of performance due to the intrusion of subnormal numbers. We identify a mechanism that allows cascading fill-ins to generate subnormal numbers and show that automatically flushing subnormals to zero avoids the performance penalties. The second factor is the lack of parallelism in the analysis and reordering phases of the solvers and the absence of floating-point arithmetic in these phases. For iterative solvers, we find that for the majority of the matrices computing or applying incomplete factorization preconditioners in single precision provides at best modest performance benefits compared with the use of double precision. We also find that using single precision for the matrix–vector product kernels provides an average speedup of 1.5 over double precision kernels. In both cases some form of refinement is needed to raise the single precision results to double precision accuracy, which will reduce performance gains.

show abstract

Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems

Cited by 46 publications

References 38 publications

A survey of numerical linear algebra methods utilizing mixed-precision arithmetic

A survey of numerical linear algebra methods utilizing mixed-precision arithmetic

Numerical behavior of NVIDIA tensor cores

Performance impact of precision reduction in sparse linear systems solvers

Contact Info

Product

Resources

About