A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

Jhurani, Chetan; Mullowney, Paul

doi:10.1016/j.jpdc.2014.09.003

Cited by 21 publications

(23 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are three typical data storage formats for matrix multiplications: the P2P format, the strided format, and the interleaved format [19,24,30]. The P2P format uses arrays whose elements are pointers to memory locations containing matrices, and the pointer arrays are passed as kernel parameters.…”

Section: Data Storage Formatmentioning

confidence: 99%

“…For example, Dongarra et al [19] compared different interfaces and data storage formats for the batched matrix multiplications, which were also discussed by us in Section 4.4. Jhurani et al [30] proposed an interface by adding a second leading dimension and implemented batched square matrix multiplications of sizes under 16 on the GPUs. In this work, we have applied a similar interface in our implementation with the strided data storage format.…”

Section: Related Workmentioning

confidence: 99%

“…In the past few years, the batched matrix multiplications have drawn increasingly more attention in both the industry [1,2] and the academy [8,19,30]. With the rapid development of high-performance computing, many-core-based architectures that rely on many lightweight computing cores and a deep memory hierarchy are becoming an important solution in designing modern supercomputers.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor

Jiang

Yang

2020

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

We present a systematic methodology for optimizing batched matrix multiplications on SW26010 many-core processor of the Sunway TaihuLight supercomputer. Five surrogate algorithms and a machine learning-based algorithm selector are proposed to fully exploit the computing capability of SW26010 and cope with the sophisticated algorithm characteristics of batched matrix multiplications. Experiment results show that the algorithm selector is able to adaptively choose the appropriate algorithm for various matrix shapes and batch sizes with low overhead and high accuracy. In particular, the optimized batched matrix multiplications can substantially outperform the non-batched version and reach around 84.8% of the performance upper bound.

show abstract

Section: Data Storage Formatmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor

Jiang

Yang

2020

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…Chittampally Vasanth Raja, Srinivas Balasubramanian and Prakash S aghavendra made a full study heterogeneous highly parallel implementation of matrix exponentiation using GPU [1]. In [2], the authors provided an interface and a general matrix routine Implementation for multiplying small matrices that were simultaneously processed on GPU. Their matrix size was less than 16.…”

Section: Related Workmentioning

confidence: 99%

“…A GPU uses shared memory to reduce memory latency. In a CPU, workloads do not require a lot of memory access and data is brought where as in the GPU there is a lot of memory access and the bandwidth is developed very well [2]. A GPU is a pile of parallel co-processors.…”

Section: Introductionmentioning

confidence: 99%

Optimization Method to Reduce Matrices Multiplications in the Context of CUDA

Khatibi¹,

Khatibi²

2018

IJCA

View full text Add to dashboard Cite

Parallel programming is an effective way to increase the speed of processing applications. It is carried out simultaneously by multiple processors rather than by a single processor. We compare the number of necessary calculations for multiplying the chain matrix in normal mode with the parallel mode. Since we used the famous parallel language named CUDA in our program, we will first present a brief description of the language and secondly, we explain essential mathematical notions and compare the performance of both programs.

show abstract

Neural Networks with Block Diagonal Inner Product Layers

Nesky

Stout

2018

Artificial Neural Networks and Machine Learning – ICANN 2018

View full text Add to dashboard Cite

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

Cited by 21 publications

References 8 publications

Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor

Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor

Optimization Method to Reduce Matrices Multiplications in the Context of CUDA

Neural Networks with Block Diagonal Inner Product Layers

Contact Info

Product

Resources

About