Efficient dense matrix‐vector multiplication on GPU

He, Gaohong; Gao, Jiaquan; Wang, Jun

doi:10.1002/cpe.4705

Cited by 12 publications

(5 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Optimizing Small and Skinny Matmul Many works has optimized Matrix Multiplication and Matrix-Vector Multiplication computations on small and skinny matrices on GPUs [6,19,39]. He et al [19] proposes an optimal warp allocation strategy for matrix-vector multiplication. KBLAS [6] uses double-buffering to overlap data motion with computation to optimize matrix-vector multiplication.…”

Section: Related Workmentioning

confidence: 99%

Fast Kronecker Matrix-Matrix Multiplication on GPUs

Jangda,

Yadav

2024

Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

View full text Add to dashboard Cite

Kronecker Matrix-Matrix Multiplication (Kron-Matmul) is the multiplication of a matrix with the Kronecker Product of several smaller matrices. Kron-Matmul is a core operation for many scientific and machine learning computations. State-of-the-art Kron-Matmul implementations utilize existing tensor algebra operations, such as matrix multiplication, transpose, and tensor matrix multiplication. However, this design choice prevents several Kron-Matmul specific optimizations, thus, leaving significant performance on the table.To address this issue, we present FastKron, an efficient technique for Kron-Matmul on single and multiple GPUs. FastKron is independent of linear algebra operations enabling several new optimizations for Kron-Matmul. Thus, it performs up to 40.7× and 7.85× faster than existing implementations on 1 and 16 GPUs respectively.

show abstract

Section: Related Workmentioning

confidence: 99%

Fast Kronecker Matrix-Matrix Multiplication on GPUs

Jangda,

Yadav

2024

Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

View full text Add to dashboard Cite

show abstract

“…when p is different, its calculation effect is quite different. The same test was performed on the gemv function, and the results show that the performance of gemv in different dimensions is also quite different [18]. when using the Sunway TaihuLight supercomputer to perform the convolution operation using the gemm matrix multiplication method, it was found that the calculation efficiency was low due to the large difference between the convolution array and the convolution kernel.…”

Section: A Convolution Calculation Optimizationmentioning

confidence: 99%

Optimization of Clutter Simulation Based on GPU

Hao

Chen

et al. 2020

IEEE Access

View full text Add to dashboard Cite

Weibull clutter is used as an example in this paper. Based on the serial parallel analysis of Zero-memory non-linear transformation's Weibull distributed clutter algorithm, fine-grained optimization is performed. The fine-grained part uses the cuBLAS library to optimize the performance of convolution calculations. Compared with CUDA shared memory convolution method and GPU parallel matrix multiplication convolution method, its computational performance can be significantly improved under a large amount of data. Simulation results show that the Zero-memory non-linear transformation's Weibull distributed clutter simulation method is optimized and accelerated. The real-time performance of clutter data is significantly improved and its acceleration effect will be better as the amount of clutter data increases. It turns out that through fine-grained optimization, the performance of convolution calculations with large amounts of data is improved.

show abstract

“…Processing big data by using GPUs has drawn much attention over the recent years. Following the introduction of the compute unified device architecture (CUDA), a programming model that supports the joint CPU/GPU execution of applications, by NVIDIA in 2007, 9 GPUs have become strong competitors as general‐purpose parallel programming systems, and have been increasingly used as tools for high‐performance computation in many fields 10–17 …”

Section: Introductionmentioning

confidence: 99%

A new diagonal storage for efficient implementation of sparse matrix–vector multiplication on graphics processing unit

Chen

Gao

2021

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Summary The sparse matrix–vector multiplication (SpMV) is of great importance in computational science. For multidiagonal sparse matrices that have many long zero sections or scatter points, a great number of zeros are filled to maintain the diagonal structure when using the popular DIA format to store them. This leads to the performance degradation of the DIA kernel. To alleviate the drawback of DIA, we present a novel diagonal storage format, called RBDCS (diagonal compressed storage based on row‐blocks), for multidiagonal sparse matrices, and thus propose an efficient SpMV kernel that corresponds to RBDCS. Given that the RBDCS kernel codes must be manually rewritten for different multidiagonal sparse matrices, a code generator is presented to automatically generate RBDCS kernel codes. Experimental results show that the proposed RBDCS kernel is effective, and outperforms HYBMV in the CUSPARSE library, and three popular diagonal SpMV kernels: DIA, HDI, and CRSD.

show abstract

Efficient dense matrix‐vector multiplication on GPU

Cited by 12 publications

References 18 publications

Fast Kronecker Matrix-Matrix Multiplication on GPUs

Fast Kronecker Matrix-Matrix Multiplication on GPUs

Optimization of Clutter Simulation Based on GPU

A new diagonal storage for efficient implementation of sparse matrix–vector multiplication on graphics processing unit

Contact Info

Product

Resources

About