The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems

Dongarra, Jack; Hammarling, Sven; Higham, Nicholas J.; Relton, Samuel D.; Valero-Lara, Pedro; Zounon, Mawussi

doi:10.1016/j.procs.2017.05.138

Cited by 60 publications

(30 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many HPC applications rely on the solution of several small-size matrix multiplications in parallel [22]. One example is the Nek5000 CFD application that uses small-size matrix multiplies for each spectral element resulting from the semi-spectral discretization [23], [24].…”

Section: B Batched Matrix Multiplicationsmentioning

confidence: 99%

NVIDIA Tensor Core Programmability, Performance & Precision

Markidis

Chien

Laure

et al. 2018

2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

282

162

View full text Add to dashboard Cite

The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called Tensor Core that performs one matrix-multiplyand-accumulate on 4×4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision.Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores.

show abstract

Section: B Batched Matrix Multiplicationsmentioning

confidence: 99%

NVIDIA Tensor Core Programmability, Performance & Precision

Markidis

Chien

Laure

et al. 2018

2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

282

162

View full text Add to dashboard Cite

show abstract

“…There are three typical data storage formats for matrix multiplications: the P2P format, the strided format, and the interleaved format [19,24,30]. The P2P format uses arrays whose elements are pointers to memory locations containing matrices, and the pointer arrays are passed as kernel parameters.…”

Section: Data Storage Formatmentioning

confidence: 99%

“…In the past few years, the batched matrix multiplications have drawn increasingly more attention in both the industry [1,2] and the academy [8,19,30]. With the rapid development of high-performance computing, many-core-based architectures that rely on many lightweight computing cores and a deep memory hierarchy are becoming an important solution in designing modern supercomputers.…”

Section: Introductionmentioning

confidence: 99%

“…This trend gives rise to the applications of batched matrix multiplications, which can be frequently found in, e.g., quantum chemistry [13], astrophysics [39], metabolic networks [32], computational fluid dynamics [44], domain decomposition solvers [10], tensor computations [45], and deep learning [6,11]. It has been proven that in these applications, the performance could be greatly improved by exploiting batched computations of small matrix multiplications [8,19,37]. As an important extension of the traditional Basic Linear Algebra Subprograms (BLAS) library [20], the new BLAS proposal has already suggested the batched matrix multiplications as an important complement [18].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor

Jiang

Yang

2020

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

We present a systematic methodology for optimizing batched matrix multiplications on SW26010 many-core processor of the Sunway TaihuLight supercomputer. Five surrogate algorithms and a machine learning-based algorithm selector are proposed to fully exploit the computing capability of SW26010 and cope with the sophisticated algorithm characteristics of batched matrix multiplications. Experiment results show that the algorithm selector is able to adaptively choose the appropriate algorithm for various matrix shapes and batch sizes with low overhead and high accuracy. In particular, the optimized batched matrix multiplications can substantially outperform the non-batched version and reach around 84.8% of the performance upper bound.

show abstract

“…This kind of design might be interesting for a GEMV or TRSV type of operation, where the matrix is read only once. Recent studies on optimized batched BLAS kernels designed for multicore architectures have shown promising results over the classical approach of solving one problem per core at a time [15], [16].…”

Section: The Interleaved Data Layoutmentioning

confidence: 99%

A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations

Haidar

Abdelfattah

Zounon

et al. 2018

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

Abstract-We present a high-performance GPU kernel with a substantial speedup over vendor libraries for very small matrix computations. In addition, we discuss most of the challenges that hinder the design of efficient GPU kernels for small matrix algorithms. We propose relevant algorithm analysis to harness the full power of a GPU, and strategies for predicting the performance, before introducing a proper implementation. We develop a theoretical analysis and a methodology for high-performance linear solvers for very small matrices. As test cases, we take the Cholesky and LU factorizations and show how the proposed methodology enables us to achieve a performance close to the theoretical upper bound of the hardware. This work investigates and proposes novel algorithms for designing highly optimized GPU kernels for solving batches of hundreds of thousands of small-size Cholesky and LU factorizations. Our focus on efficient batched Cholesky and batched LU kernels is motivated by the increasing need for these kernels in scientific simulations (e.g., astrophysics applications). Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. The proposed GPU kernels achieve performance speedups vs. CUBLAS of up to 6× for the factorizations, using double precision arithmetic on an NVIDIA Pascal P100 GPU.

show abstract

The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems

Cited by 60 publications

References 11 publications

NVIDIA Tensor Core Programmability, Performance & Precision

NVIDIA Tensor Core Programmability, Performance & Precision

Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor

A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations

Contact Info

Product

Resources

About