Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs

Abdelfattah, Ahmad; Haidar, Azzam; Tomov, Stanimire; Dongarra, Jack

doi:10.1145/3079079.3079103

Cited by 20 publications

(14 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Kokkos Kernels batched BLAS/LAPACK interface provides multiple functor-level interfaces for dense linear algebra (DLA), which is suitable for Kokkos hierarchical parallelism. Unlike other batched BLAS and LAPACK interface, Intel batched GEMM [9], cuBLAS batched GEMM [24], MAGMA batched GEMM [25], we do not provide a frontlevel (or subroutine) interface that launches a streaming parallel kernel. Instead, we provide a functor-level interface that can be used in a Kokkos parallel execution patterns e.g., parallel for, parallel reduce and parallel scan.…”

Section: Parallel Batched Blas/lapack Interfacesmentioning

confidence: 99%

See 1 more Smart Citation

Kokkos Kernels: Performance Portable Sparse/Dense Linear Algebra and Graph Kernels

Rajamanickam¹,

Acer²,

Berger-Vergiat³

et al. 2021

Preprint

View full text Add to dashboard Cite

As hardware architectures are evolving in the push towards exascale, developing Computational Science and Engineering (CSE) applications depend on performance portable approaches for sustainable software development. This paper describes one aspect of performance portability with respect to developing a portable library of kernels that serve the needs of several CSE applications and software frameworks. We describe Kokkos Kernels, a library of kernels for sparse linear algebra, dense linear algebra and graph kernels. We describe the design principles of such a library and demonstrate portable performance of the library using some selected kernels. Specifically, we demonstrate the performance of four sparse kernels, three dense batched kernels, two graph kernels and one team level algorithm.

show abstract

Section: Parallel Batched Blas/lapack Interfacesmentioning

confidence: 99%

“…The input arrays contain n random 32-bit integers, with times averaged over 1000 different arrays. Thrust is obviously the fastest for the device-level sort, but Kokkos Kernels gives an average speedup of 1.3x over Kokkos for 2 16 ≤ n ≤ 2 25 .…”

Section: Multi-level Bitonic Sortingmentioning

confidence: 99%

Kokkos Kernels: Performance Portable Sparse/Dense Linear Algebra and Graph Kernels

Rajamanickam¹,

Acer²,

Berger-Vergiat³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Blocking factors. The blocking factors have significant impact on the performance of matrix multiplications and have been carefully analyzed and tuned in many previous works, such as References [9,40,47]. The evaluation of the blocking factors depends on the methods of parallelization and constraints of LDM capacity and can be used to extract important information such as the on-chip data reuse and the amount of data transfer between the main memory and the LDM.…”

Section: Algorithm Analysismentioning

confidence: 99%

“…As an important extension of the traditional Basic Linear Algebra Subprograms (BLAS) library [20], the new BLAS proposal has already suggested the batched matrix multiplications as an important complement [18]. Recently, vendor-provided libraries such as Intel MKL [1] and NVIDIA cuBLAS [2], and academic research programs such as MAGMA [9], have all added the support of this subroutine. Therefore, it is of great importance to study performance optimization methods for the batched matrix multiplications on state-of-theart hardware platforms.…”

Section: Introductionmentioning

confidence: 99%

Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor

Jiang

Yang

2020

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

We present a systematic methodology for optimizing batched matrix multiplications on SW26010 many-core processor of the Sunway TaihuLight supercomputer. Five surrogate algorithms and a machine learning-based algorithm selector are proposed to fully exploit the computing capability of SW26010 and cope with the sophisticated algorithm characteristics of batched matrix multiplications. Experiment results show that the algorithm selector is able to adaptively choose the appropriate algorithm for various matrix shapes and batch sizes with low overhead and high accuracy. In particular, the optimized batched matrix multiplications can substantially outperform the non-batched version and reach around 84.8% of the performance upper bound.

show abstract

“…Batched routines are typically classified into two subsets: those where all data entities have the same size, and those where the data entities can differ in size (within a range). The latter type of batched routines, usually referred to as "variable-size," are more complicated in design, but offer higher flexibility in terms of target applications [2].…”

Section: Batched Routinesmentioning

confidence: 99%

Variable-Size Batched Condition Number Calculation on GPUs

Anzt

Dongarra

Flegar

et al. 2018

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Self Cite

View full text Add to dashboard Cite

We present a kernel that is designed to quickly compute the condition number of a large collection of tiny matrices on a graphics processing unit (GPU). The matrices can differ in size and the process integrates the use of pivoting to ensure a numerically-stable matrix inversion. The performance assessment reveals that, in double precision arithmetic, the new GPU kernel achieves up to 550 GFLOPs (billions of floating-point operations per second) and 800 GFLOPs on NVIDIA's P100 and V100 GPUs, respectively. The results also demonstrate a considerable speed-up with respect to a workflow that computes the condition number via launching a set of four batched kernels. In addition, we present a variable-size batched kernel for the computation of the matrix infinity norm. We show that this memory-bound kernel achieves up to 90% of the sustainable peak bandwidth.

show abstract

Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs

Cited by 20 publications

References 17 publications

Kokkos Kernels: Performance Portable Sparse/Dense Linear Algebra and Graph Kernels

Kokkos Kernels: Performance Portable Sparse/Dense Linear Algebra and Graph Kernels

Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor

Variable-Size Batched Condition Number Calculation on GPUs

Contact Info

Product

Resources

About