2017
DOI: 10.1016/j.procs.2017.05.138
|View full text |Cite
|
Sign up to set email alerts
|

The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems

Abstract: A current trend in high-performance computing is to decompose a large linear algebra problem into batches containing thousands of smaller problems, that can be solved independently, before collating the results. To standardize the interface to these routines, the community is developing an extension to the BLAS standard (the batched BLAS), enabling users to perform thousands of small BLAS operations in parallel whilst making efficient use of their hardware. We discuss the benefits and drawbacks of the current … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
30
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
4
3
1

Relationship

3
5

Authors

Journals

citations
Cited by 60 publications
(30 citation statements)
references
References 11 publications
0
30
0
Order By: Relevance
“…Many HPC applications rely on the solution of several small-size matrix multiplications in parallel [22]. One example is the Nek5000 CFD application that uses small-size matrix multiplies for each spectral element resulting from the semi-spectral discretization [23], [24].…”
Section: B Batched Matrix Multiplicationsmentioning
confidence: 99%
“…Many HPC applications rely on the solution of several small-size matrix multiplications in parallel [22]. One example is the Nek5000 CFD application that uses small-size matrix multiplies for each spectral element resulting from the semi-spectral discretization [23], [24].…”
Section: B Batched Matrix Multiplicationsmentioning
confidence: 99%
“…There are three typical data storage formats for matrix multiplications: the P2P format, the strided format, and the interleaved format [19,24,30]. The P2P format uses arrays whose elements are pointers to memory locations containing matrices, and the pointer arrays are passed as kernel parameters.…”
Section: Data Storage Formatmentioning
confidence: 99%
“…In the past few years, the batched matrix multiplications have drawn increasingly more attention in both the industry [1,2] and the academy [8,19,30]. With the rapid development of high-performance computing, many-core-based architectures that rely on many lightweight computing cores and a deep memory hierarchy are becoming an important solution in designing modern supercomputers.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…This kind of design might be interesting for a GEMV or TRSV type of operation, where the matrix is read only once. Recent studies on optimized batched BLAS kernels designed for multicore architectures have shown promising results over the classical approach of solving one problem per core at a time [15], [16].…”
Section: The Interleaved Data Layoutmentioning
confidence: 99%