Correctly rounded architectures for Floating-Point multi-operand addition and dot-product computation

Yao, Tao; Gao, Deyuan; Fan, Xiaoya; Nurmi, Jari

doi:10.1109/asap.2013.6567600

Cited by 14 publications

(10 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…7.3), (Tenca, 2009), and is most likely done in hardware. This is in line with the literature on hardware dot products (Kim & Kim, 2009;Tao et al, 2013;Sohn & Swartzlander, 2016;Kaul et al, 2019), where either sorting or a search for the maximum exponent is performed. Furthermore, this experiment demonstrates that none of the additions are performed before aligning the significands relative to the largest exponent: if evaluated before the arguments are shifted right relative to the largest magnitude arguments' exponent (by having multiple alignment stages), any other sum would evaluate to 2 −24 + 2 −24 = 2 −23 , a value that then could be added exactly to the total sum as the least significand bits would not be lost in the alignment.…”

Section: Accuracy Of the Dot Productssupporting

confidence: 80%

“…We can show that the lack of normalization causes the dot product in tensor cores-and most likely in any other similar architectures in which partial sums are not normalized (Kim & Kim, 2009;Tao et al, 2013;Sohn & Swartzlander, 2016;Kaul et al, 2019)-to behave non-monotonically. Let us consider (3) and set all the elements in the first column of B to 2 −24 and then c 11 to 1 − 2 −24 and 1 in turn.…”

Section: Monotonicity Of Dot Productmentioning

confidence: 94%

See 1 more Smart Citation

Numerical behavior of NVIDIA tensor cores

Fasi

Higham

Mikaitis

et al. 2021

PeerJ Computer Science

View full text Add to dashboard Cite

We explore the floating-point arithmetic implemented in the NVIDIA tensor cores, which are hardware accelerators for mixed-precision matrix multiplication available on the Volta, Turing, and Ampere microarchitectures. Using Volta V100, Turing T4, and Ampere A100 graphics cards, we determine what precision is used for the intermediate results, whether subnormal numbers are supported, what rounding mode is used, in which order the operations underlying the matrix multiplication are performed, and whether partial sums are normalized. These aspects are not documented by NVIDIA, and we gain insight by running carefully designed numerical experiments on these hardware units. Knowing the answers to these questions is important if one wishes to: (1) accurately simulate NVIDIA tensor cores on conventional hardware; (2) understand the differences between results produced by code that utilizes tensor cores and code that uses only IEEE 754-compliant arithmetic operations; and (3) build custom hardware whose behavior matches that of NVIDIA tensor cores. As part of this work we provide a test suite that can be easily adapted to test newer versions of the NVIDIA tensor cores as well as similar accelerators from other vendors, as they become available. Moreover, we identify a non-monotonicity issue affecting floating point multi-operand adders if the intermediate results are not normalized after each step.

show abstract

Section: Accuracy Of the Dot Productssupporting

confidence: 80%

Section: Monotonicity Of Dot Productmentioning

confidence: 94%

Numerical behavior of NVIDIA tensor cores

Fasi

Higham

Mikaitis

et al. 2021

PeerJ Computer Science

View full text Add to dashboard Cite

show abstract

“…Redistribution subject to CCBY license use in interval arithmetic (see, for example, [24], [29]). Evaluating according to (2.1) is expensive, and although proposals have been made for implementing it in hardware (e.g., [5], [33]), it is not, to our knowledge, supported in commercial processors because of the hardware costs. However, manufacturers could implement something between (2.1) and (2.2) by using a little extra precision, perhaps the extended precisions defined in the IEEE standard [22] or the 80-bit registers on Intel processors.…”

Section: Introductionmentioning

confidence: 99%

Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores

Blanchard¹,

Higham²,

Lopez³

et al. 2020

SIAM J. Sci. Comput.

View full text Add to dashboard Cite

Computing units that carry out a fused multiply-add (FMA) operation with matrix arguments, referred to as tensor units by some vendors, have great potential for use in scientific computing. However, these units are inherently mixed precision, and existing rounding error analyses do not support them. We consider a mixed precision block FMA that generalizes both the usual scalar FMA and existing tensor units. We describe how to exploit such a block FMA in the numerical linear algebra kernels of matrix multiplication and LU factorization and give detailed rounding error analyses of both kernels. An important application is to GMRES-based iterative refinement with block FMAs, about which our analysis provides new insight. Our framework is applicable to the tensor core units in the NVIDIA Volta and Turing GPUs. For these we compare matrix multiplication and LU factorization with TC16 and TC32 forms of FMA, which differ in the precision used for the output of the tensor cores. Our experiments on an NVDIA V100 GPU confirm the predictions of the analysis that the TC32 variant is much more accurate than the TC16 one, and they show that the accuracy boost is obtained with almost no performance loss.

show abstract

“…Fused sums or sums of products have been studied before [5], [6], [7], [8], [9], [10], [11], [1]. However, only [6], [8], [1] are exact.…”

Section: B Related Work and Previous Implementationsmentioning

confidence: 99%

“…Fused sums or sums of products have been studied before [5], [6], [7], [8], [9], [10], [11], [1]. However, only [6], [8], [1] are exact. All the others either truncate the term summation or compress the smaller magnitude terms into sticky bits, which leads to inexact results in some cases of cancellation.…”

Section: B Related Work and Previous Implementationsmentioning

confidence: 99%

Exact Fused Dot Product Add Operators

Desrentes,

de Dinechin,

de Dinechin

2023

2023 IEEE 30th Symposium on Computer Arithmetic (ARITH)

View full text Add to dashboard Cite

This article explores architectures of exact (correctly rounded) fused dot product and add operators suitable for the FP32 and FP64 binary floating-point representations with subnormal support, and other representations with a wide dynamic range such as bfloat16. The exact summation of terms before rounding requires a full-size accumulator, and this work discusses techniques to compress the identical bits of this accumulator. This requires the computation of the relative shift amounts of the terms, which is formulated as a parallel prefix algorithm, allowing for a low-latency implementation. Architectural options for the exact fused dot product and add operators with up to 16 products for FP32, FP64 and mixed-precision BF16 to FP32 are evaluated using the TSMC 16FFC technology node.

show abstract

Correctly rounded architectures for Floating-Point multi-operand addition and dot-product computation

Cited by 14 publications

References 24 publications

Numerical behavior of NVIDIA tensor cores

Numerical behavior of NVIDIA tensor cores

Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores

Exact Fused Dot Product Add Operators

Contact Info

Product

Resources

About