2020
DOI: 10.1137/19m1289546
|View full text |Cite
|
Sign up to set email alerts
|

Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores

Abstract: Computing units that carry out a fused multiply-add (FMA) operation with matrix arguments, referred to as tensor units by some vendors, have great potential for use in scientific computing. However, these units are inherently mixed precision, and existing rounding error analyses do not support them. We consider a mixed precision block FMA that generalizes both the usual scalar FMA and existing tensor units. We describe how to exploit such a block FMA in the numerical linear algebra kernels of matrix multiplica… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
44
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
7
2
1

Relationship

4
6

Authors

Journals

citations
Cited by 41 publications
(46 citation statements)
references
References 18 publications
2
44
0
Order By: Relevance
“…orders of magnitude smaller than the pure FP16 factorization [30], while retaining the high performance of the FP16 variant.…”
Section: (A) Auto-adaptive Roundingmentioning
confidence: 99%
“…orders of magnitude smaller than the pure FP16 factorization [30], while retaining the high performance of the FP16 variant.…”
Section: (A) Auto-adaptive Roundingmentioning
confidence: 99%
“…Hardware that employs low-precision matrix multiplication with accumulation at high precision, such as the the NVIDIA tensor cores, requires a careful analysis that takes account of the internal precisions and the matrix size. A general such analysis, which quantifies the gain from the use of higher precision accumulation, is given in Blanchard et al (2020). A second question concerns the interaction of precision and dimension: an error bound with a constant nu (say) provides no information if nu>1, such as when n=104 and u is the unit roundoff for half precision (see Table 1).…”
Section: Dense Linear Algebramentioning
confidence: 99%
“…The main finding is that low-precision operations and usage of tensor cores increase the amount of correct data produced by the GPU, despite increasing the impact of numerical errors due to the use of lower-precision data. In order to quantify the accuracy of tensor cores, Blanchard et al (2020) provide a rounding error analysis of what they call a block fused multiply-add (FMA), a generalization of the multiply-accumulate operation in (1) in which the matrix sizes, the precisions of the arguments, and the internal precision of the accumulator are taken as parameters.…”
Section: Previous Workmentioning
confidence: 99%