Numerical behavior of NVIDIA tensor cores

Fasi, Massimiliano; Higham, Nicholas J.; Mikaitis, Mantas; Pranesh, Srikara

doi:10.7717/peerj-cs.330

Cited by 28 publications

(33 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We have also performed experiments using the AMD dual-socket EPYC Naples system with 64 cores and the NVIDIA P100 GPU; and we obtained similar results. We note that the arithmetic properties of the NVIDIA GPUs are investigated in Fasi et al. (2021) .…”

Section: Methodsmentioning

confidence: 99%

Performance impact of precision reduction in sparse linear systems solvers

Zounon

Higham

Lucas

et al. 2022

PeerJ Computer Science

Self Cite

View full text Add to dashboard Cite

It is well established that reduced precision arithmetic can be exploited to accelerate the solution of dense linear systems. Typical examples are mixed precision algorithms that reduce the execution time and the energy consumption of parallel solvers for dense linear systems by factorizing a matrix at a precision lower than the working precision. Much less is known about the efficiency of reduced precision in parallel solvers for sparse linear systems, and existing work focuses on single core experiments. We evaluate the benefits of using single precision arithmetic in solving a double precision sparse linear system using multiple cores. We consider both direct methods and iterative methods and we focus on using single precision for the key components of LU factorization and matrix–vector products. Our results show that the anticipated speedup of 2 over a double precision LU factorization is obtained only for the very largest of our test problems. We point out two key factors underlying the poor speedup. First, we find that single precision sparse LU factorization is prone to a severe loss of performance due to the intrusion of subnormal numbers. We identify a mechanism that allows cascading fill-ins to generate subnormal numbers and show that automatically flushing subnormals to zero avoids the performance penalties. The second factor is the lack of parallelism in the analysis and reordering phases of the solvers and the absence of floating-point arithmetic in these phases. For iterative solvers, we find that for the majority of the matrices computing or applying incomplete factorization preconditioners in single precision provides at best modest performance benefits compared with the use of double precision. We also find that using single precision for the matrix–vector product kernels provides an average speedup of 1.5 over double precision kernels. In both cases some form of refinement is needed to raise the single precision results to double precision accuracy, which will reduce performance gains.

show abstract

Section: Methodsmentioning

confidence: 99%

Performance impact of precision reduction in sparse linear systems solvers

Zounon

Higham

Lucas

et al. 2022

PeerJ Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…The accumulator inside Tensor Cores has at least 2 extra bits of mantissa and RZ is used for rounding [6]. It follows that RZ is performed in the accumulator frag c in every k iteration in Code 2.…”

Section: Avoiding Rz During Tensor Core Accumulationmentioning

confidence: 99%

“…Jia et al and Raihan et al analyze how Tensor Core assembly instructions divide the input matrices, and the order they compute multiplications of the subdivided matrices [14,25]. There have also been studies on how Tensor Cores support subnormal numbers and use RZ (Round toward Zero) [6]. Others have performed error analysis of Tensor Cores, where the theoretical error bound of mixed-precision block FMA computation is analyzed and compared to the actual error of Tensor Cores [1].…”

Section: Introductionmentioning

confidence: 99%

Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance

Ootomo¹,

Yokota²

2022

Preprint

View full text Add to dashboard Cite

Tensor Core is a mixed-precision matrix-matrix multiplication unit on NVIDIA GPUs with a theoretical peak performance of more than 300 TFlop/s on Ampere architectures. Tensor Cores were developed in response to the high demand of dense matrix multiplication from machine learning. However, many applications in scientific computing such as preconditioners for iterative solvers and low-precision Fourier transforms can exploit these Tensor Cores. To compute a matrix multiplication on Tensor Cores, we need to convert input matrices to half-precision, which results in loss of accuracy. To avoid this, we can keep the mantissa loss in the conversion using additional half-precision variables and use them for correcting the accuracy of matrix-matrix multiplication. Even with this correction, the use of Tensor Cores yields higher throughput compared to FP32 SIMT Cores. Nevertheless, the correcting capability of this method alone is limited, and the resulting accuracy cannot match that of a matrix multiplication on FP32 SIMT Cores. We address this problem and develop a high accuracy, high performance, and low power consumption matrix-matrix multiplication implementation using Tensor Cores, which exactly matches the accuracy of FP32 SIMT Cores while achieving superior throughput. The implementation is based on NVIDIA's CUTLASS. We found that the key to achieving this accuracy is how to deal with the rounding inside Tensor Cores and underflow probability during the correction computation. Our implementation achieves 51TFlop/s for a limited exponent range using FP16 Tensor Cores and 33TFlop/s for full exponent range of FP32 using TF32 Tensor Cores on NVIDIA A100 GPUs, which outperforms the theoretical FP32 SIMT Core peak performance of 19.5TFlop/s.

show abstract

“…In this data, reuse is done to reduce the consumption of energy and memory access. The numerical explanation of the TPU of NVIDA was done and floating point operation was studied and its shortcomings was identified and non-monotonicity issue that concern the floating point was explained ( Fasi et al, 2021 ).…”

Section: Introductionmentioning

confidence: 99%

Effect of neural network structure in accelerating performance and accuracy of a convolutional neural network with GPU/TPU for image analytics

Ravikumar

Sriraman

Saketh

et al. 2022

PeerJ Computer Science

View full text Add to dashboard Cite

Background In deep learning the most significant breakthrough in the field of image recognition, object detection language processing was done by Convolutional Neural Network (CNN). Rapid growth in data and neural networks the performance of the DNN algorithms depends on the computation power and the storage capacity of the devices. Methods In this paper, the convolutional neural network used for various image applications was studied and its acceleration in the various platforms like CPU, GPU, TPU was done. The neural network structure and the computing power and characteristics of the GPU, TPU was analyzed and summarized, the effect of these on accelerating the tasks is also explained. Cross-platform comparison of the CNN was done using three image applications the face mask detection (object detection/Computer Vision), Virus Detection in Plants (Image Classification: agriculture sector), and Pneumonia detection from X-ray Images (Image Classification/medical field). Results The CNN implementation was done and a comprehensive comparison was done on the platforms to identify the performance, throughput, bottlenecks, and training time. The CNN layer-wise execution in GPU and TPU is explained with layer-wise analysis. The impact of the fully connected layer and convolutional layer on the network is analyzed. The challenges faced during the acceleration process were discussed and future works are identified.

show abstract

Numerical behavior of NVIDIA tensor cores

Cited by 28 publications

References 13 publications

Performance impact of precision reduction in sparse linear systems solvers

Performance impact of precision reduction in sparse linear systems solvers

Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance

Effect of neural network structure in accelerating performance and accuracy of a convolutional neural network with GPU/TPU for image analytics

Contact Info

Product

Resources

About