The block Wiedemann (BW) algorithm is frequently used to solve sparse linear systems over GF(2). Iterative sparse matrix-vector multiplication is the most time-consuming operation. The necessity to accelerate this step is motivated by the application of BW to very large matrices used in the linear algebra step of the number field sieve (NFS) for integer factorization. In this paper, we derive an efficient CUDA implementation of this operation by using a newly designed hybrid sparse matrix format. This leads to speedups between 4 and 8 on a single graphics processing unit (GPU) for a number of tested NFS matrices compared with an optimized multicore implementation. We further present a GPU cluster implementation of the full BW for NFS matrices. A small-sized GPU cluster is able to outperform CPU clusters of larger size for large matrices such as the one obtained from the Kilobit special NFS factorization. ITERATIVE SPARSE MATRIX-VECTOR MULTIPLICATION 587 for example, a GeForce GTX580 (NVIDIA Corp.) has 192.4-GB/s memory bandwidth, whereas an Intel Core-i7 (Intel Corp., Santa Clara, CA, USA) has a maximum of 25.6-GB/s memory bandwidth.Sparse matrix-vector multiplication on the GPU has been explored previously in several papers [11][12][13][14] for matrices derived from scientific computing applications. However, sparse matrices derived from NFS have, generally, different properties, that is, they are larger, have a few dense rows, and have many extremely sparse rows. The large size of the matrix causes the BL and BW algorithms to require a large number of SpMV iterations. This means that the time spent for matrix preprocessing and matrix data transfer to the GPU memory are negligible compared with the total runtime. Thus, approaches to SpMV on GPUs for NFS matrices may be different from previously published GPU SpMV approaches.One limitation of using a GPU is that memory size of one device is fixed; in other words, we cannot install more RAM on a GPU like on a CPU workstation. For example, a Fermi Tesla C2070 has only 6-GB RAM. Sending the matrix back and forth from CPU memory to GPU memory are very costly operations, thus forcing us to use multiple GPUs to solve large matrices. The challenge of multi-GPU SpMV is to coordinate the communication between GPUs and between GPUs and CPUs efficiently while maintaining good performance. We address this challenge by presenting a multi-GPU SpMV implementation that achieves considerable speedup over traditional CPU cluster and grid implementations. This paper extends our conference paper [15] by presenting a GPU cluster parallelization and more in-depth experimental results.This paper is organized as follows. Section 2 introduces some background about the BW algorithm. Section 3 describes several published sparse matrix formats and their GPU performance when applied to NFS matrices. Section 4 presents our new formats specifically designed for NFS matrices and their CUDA implementation for the Fermi GPU architecture. The performance of our format on single GPU and dual GPU ar...