2018
DOI: 10.1002/cpe.4705
|View full text |Cite
|
Sign up to set email alerts
|

Efficient dense matrix‐vector multiplication on GPU

Abstract: Given that the dense matrix-vector multiplication (Ax or A T x) is of great importance in scientific computations, how to accelerate it is investigated on the graphics processing unit (GPU) in this paper. We present a warp-based implementation of Ax on the GPU, called GEMV-Adaptive, and a thread-based implementation of A T x on the GPU, called GEMV-T-Adaptive. For our proposed GEMV-Adaptive and GEMV-T-Adaptive, there are the following novelties: (1) an adaptive warp allocation strategy for GEMV-Adaptive is pro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
5
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
3
1

Relationship

2
5

Authors

Journals

citations
Cited by 12 publications
(5 citation statements)
references
References 18 publications
0
5
0
Order By: Relevance
“…Optimizing Small and Skinny Matmul Many works has optimized Matrix Multiplication and Matrix-Vector Multiplication computations on small and skinny matrices on GPUs [6,19,39]. He et al [19] proposes an optimal warp allocation strategy for matrix-vector multiplication. KBLAS [6] uses double-buffering to overlap data motion with computation to optimize matrix-vector multiplication.…”
Section: Related Workmentioning
confidence: 99%
“…Optimizing Small and Skinny Matmul Many works has optimized Matrix Multiplication and Matrix-Vector Multiplication computations on small and skinny matrices on GPUs [6,19,39]. He et al [19] proposes an optimal warp allocation strategy for matrix-vector multiplication. KBLAS [6] uses double-buffering to overlap data motion with computation to optimize matrix-vector multiplication.…”
Section: Related Workmentioning
confidence: 99%
“…when p is different, its calculation effect is quite different. The same test was performed on the gemv function, and the results show that the performance of gemv in different dimensions is also quite different [18]. when using the Sunway TaihuLight supercomputer to perform the convolution operation using the gemm matrix multiplication method, it was found that the calculation efficiency was low due to the large difference between the convolution array and the convolution kernel.…”
Section: A Convolution Calculation Optimizationmentioning
confidence: 99%
“…Processing big data by using GPUs has drawn much attention over the recent years. Following the introduction of the compute unified device architecture (CUDA), a programming model that supports the joint CPU/GPU execution of applications, by NVIDIA in 2007, 9 GPUs have become strong competitors as general‐purpose parallel programming systems, and have been increasingly used as tools for high‐performance computation in many fields 10–17 …”
Section: Introductionmentioning
confidence: 99%