2015
DOI: 10.1016/j.jpdc.2014.09.003
|View full text |Cite
|
Sign up to set email alerts
|

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

Abstract: We present an interface and an implementation of the General Matrix Multiply (GEMM) routine for multiple small matrices processed simultaneously on NVIDIA graphics processing units (GPUs). We focus on matrix sizes under 16. The implementation can be easily extended to larger sizes. For single precision matrices, our implementation is 30% to 600% faster than the batched cuBLAS implementation distributed in the CUDA Toolkit 5.0 on NVIDIA Tesla K20c. For example, we obtain 104 GFlop/s and 216 GFlop/s when multipl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
23
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 21 publications
(23 citation statements)
references
References 8 publications
0
23
0
Order By: Relevance
“…There are three typical data storage formats for matrix multiplications: the P2P format, the strided format, and the interleaved format [19,24,30]. The P2P format uses arrays whose elements are pointers to memory locations containing matrices, and the pointer arrays are passed as kernel parameters.…”
Section: Data Storage Formatmentioning
confidence: 99%
See 2 more Smart Citations
“…There are three typical data storage formats for matrix multiplications: the P2P format, the strided format, and the interleaved format [19,24,30]. The P2P format uses arrays whose elements are pointers to memory locations containing matrices, and the pointer arrays are passed as kernel parameters.…”
Section: Data Storage Formatmentioning
confidence: 99%
“…For example, Dongarra et al [19] compared different interfaces and data storage formats for the batched matrix multiplications, which were also discussed by us in Section 4.4. Jhurani et al [30] proposed an interface by adding a second leading dimension and implemented batched square matrix multiplications of sizes under 16 on the GPUs. In this work, we have applied a similar interface in our implementation with the strided data storage format.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Chittampally Vasanth Raja, Srinivas Balasubramanian and Prakash S aghavendra made a full study heterogeneous highly parallel implementation of matrix exponentiation using GPU [1]. In [2], the authors provided an interface and a general matrix routine Implementation for multiplying small matrices that were simultaneously processed on GPU. Their matrix size was less than 16.…”
Section: Related Workmentioning
confidence: 99%
“…A GPU uses shared memory to reduce memory latency. In a CPU, workloads do not require a lot of memory access and data is brought where as in the GPU there is a lot of memory access and the bandwidth is developed very well [2]. A GPU is a pile of parallel co-processors.…”
Section: Introductionmentioning
confidence: 99%