Given that the dense matrix-vector multiplication (Ax or A T x) is of great importance in scientific computations, how to accelerate it is investigated on the graphics processing unit (GPU) in this paper. We present a warp-based implementation of Ax on the GPU, called GEMV-Adaptive, and a thread-based implementation of A T x on the GPU, called GEMV-T-Adaptive. For our proposed GEMV-Adaptive and GEMV-T-Adaptive, there are the following novelties: (1) an adaptive warp allocation strategy for GEMV-Adaptive is proposed to assign the optimal warp number for each matrix row, (2) an adaptive thread allocation strategy for GEMV-T-Adaptive is designed to assign the optimal thread number to each matrix row, and (3) several optimization schemes are formulated. Experimental results show that the proposed GEMV-Adaptive and GEMV-T-Adaptive mitigate the performance fluctuations of the implementations in the CUBLAS library, always have high performance, and outperform the most recently proposed GEMV and GEMV-T kernels by Gao et al, respectively, for all test matrices. KEYWORDS CUDA, dense matrix-vector multiplication, GPU
INTRODUCTIONThe dense matrix-vector multiplication routine performs one ofwhere A ∈ R m×n is a dense matrix, and y and x are vectors. It has proven to be of particular importance in computational science and has been successfully applied in various fields. 1-7 As the matrix size increases in the practical problems, parallel computing is required to efficiently improve the performance of the dense matrix-vector multiplication.Processing big data by using graphics processing units (GPUs) has drawn much attention over the recent years. Following the introduction of the compute unified device architecture (CUDA), a programming model that supports the joint CPU/GPU execution of applications, by NVIDIA 8 in 2007,GPUs have become strong competitors as general-purpose parallel programming systems.Researchers have recently developed suitable and flexible dense matrix-vector multiplication algorithms on the GPU architecture. 9-14 The representative of them is KBLAS by Abdelfattah et al. 14 KBLAS is an optimized library for dense MVM kernels on GPUs and can efficiently run on various GPU architectures while avoiding code rewriting and retaining compliance with the standard BLAS API. Experimental results show that the KBLAS performance either matches or outperforms existing state-of-the-art open-source and commercial implementations (eg, NVIDIA's standard BLAS implementation CUBLAS, 15 MAGMABLAS, 16 and CULA 17 ) on different matrix sizes. A subset of KBLAS high-performance kernels has been integrated into CUBLAS for larger dissemination, starting with version 6.0. 18 Concurrency Computat Pract Exper. 2018;30:e4705. wileyonlinelibrary.com/journal/cpe