Proceedings of the International Conference on Supercomputing 2017
DOI: 10.1145/3079079.3079103
|View full text |Cite
|
Sign up to set email alerts
|

Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs

Abstract: This paper presents a software framework for solving large numbers of relatively small matrix problems using GPUs. Our approach combines novel and existing HPC techniques to methodically apply performance analysis, kernel design, low-level optimizations, and autotuning to exceed in performance proprietary vendor libraries. As a case study, we discuss the fundamental matrix operations defined by the Basic Linear Algebra Subprograms (BLAS) standard. This case study is significantly important for wide range of ap… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
14
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 20 publications
(14 citation statements)
references
References 17 publications
0
14
0
Order By: Relevance
“…The Kokkos Kernels batched BLAS/LAPACK interface provides multiple functor-level interfaces for dense linear algebra (DLA), which is suitable for Kokkos hierarchical parallelism. Unlike other batched BLAS and LAPACK interface, Intel batched GEMM [9], cuBLAS batched GEMM [24], MAGMA batched GEMM [25], we do not provide a frontlevel (or subroutine) interface that launches a streaming parallel kernel. Instead, we provide a functor-level interface that can be used in a Kokkos parallel execution patterns e.g., parallel for, parallel reduce and parallel scan.…”
Section: Parallel Batched Blas/lapack Interfacesmentioning
confidence: 99%
See 1 more Smart Citation
“…The Kokkos Kernels batched BLAS/LAPACK interface provides multiple functor-level interfaces for dense linear algebra (DLA), which is suitable for Kokkos hierarchical parallelism. Unlike other batched BLAS and LAPACK interface, Intel batched GEMM [9], cuBLAS batched GEMM [24], MAGMA batched GEMM [25], we do not provide a frontlevel (or subroutine) interface that launches a streaming parallel kernel. Instead, we provide a functor-level interface that can be used in a Kokkos parallel execution patterns e.g., parallel for, parallel reduce and parallel scan.…”
Section: Parallel Batched Blas/lapack Interfacesmentioning
confidence: 99%
“…The input arrays contain n random 32-bit integers, with times averaged over 1000 different arrays. Thrust is obviously the fastest for the device-level sort, but Kokkos Kernels gives an average speedup of 1.3x over Kokkos for 2 16 ≤ n ≤ 2 25 .…”
Section: Multi-level Bitonic Sortingmentioning
confidence: 99%
“…Blocking factors. The blocking factors have significant impact on the performance of matrix multiplications and have been carefully analyzed and tuned in many previous works, such as References [9,40,47]. The evaluation of the blocking factors depends on the methods of parallelization and constraints of LDM capacity and can be used to extract important information such as the on-chip data reuse and the amount of data transfer between the main memory and the LDM.…”
Section: Algorithm Analysismentioning
confidence: 99%
“…As an important extension of the traditional Basic Linear Algebra Subprograms (BLAS) library [20], the new BLAS proposal has already suggested the batched matrix multiplications as an important complement [18]. Recently, vendor-provided libraries such as Intel MKL [1] and NVIDIA cuBLAS [2], and academic research programs such as MAGMA [9], have all added the support of this subroutine. Therefore, it is of great importance to study performance optimization methods for the batched matrix multiplications on state-of-theart hardware platforms.…”
Section: Introductionmentioning
confidence: 99%
“…Batched routines are typically classified into two subsets: those where all data entities have the same size, and those where the data entities can differ in size (within a range). The latter type of batched routines, usually referred to as "variable-size," are more complicated in design, but offer higher flexibility in terms of target applications [2].…”
Section: Batched Routinesmentioning
confidence: 99%