Analysis and performance estimation of the Conjugate Gradient method on multiple GPUs

Verschoor, Mickeal; Jalba, Andrei C.

doi:10.1016/j.parco.2012.07.002

Cited by 37 publications

(24 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Comparing their implementation against the one by Buatois et al, they could achieve 3.7× higher average performance for the total set of matrices, although the individual performance was worse for 33% of the test cases. Neglecting the three test cases of extremely poor performance mentioned above, the advantage of [17] reached a notable 6.1× over Buatois et al implementation.…”

Section: Related Workmentioning

confidence: 82%

“…Verschoor and Jalba [17] also aimed at improving the spmv using the BCSR format, in their case by analyzing the effect of certain reorderings of the blocks. These authors evaluated the total speed-up, considering the average execution time for all the test cases, and reported that their implementation was 1.25× faster on average than Bell and Garland's hybrid format-based implementation.…”

Section: Related Workmentioning

confidence: 99%

“…These authors evaluated the total speed-up, considering the average execution time for all the test cases, and reported that their implementation was 1.25× faster on average than Bell and Garland's hybrid format-based implementation. Moreover, the algorithm in [17] performs poorly on matrices that have unbalanced row lengths, and the authors showed that, by dropping three cases out of the 64 from the comparison, the total speed-up over Bell and Garland's work became 2.5×. Comparing their implementation against the one by Buatois et al, they could achieve 3.7× higher average performance for the total set of matrices, although the individual performance was worse for 33% of the test cases.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Leveraging Data-Parallelism in ILUPACK using Graphics Processors

Aliaga

Bollhöfer

Dufrechou

et al. 2014

2014 IEEE 13th International Symposium on Parallel and Distributed Computing

View full text Add to dashboard Cite

In this paper, we address the exploitation of dataparallelism for the solution of sparse symmetric positive definite linear systems via iterative methods on Graphics Processing Units (GPUs). In particular, we accelerate the preconditioned CG-based iterative solver underlying the incomplete LU decomposition package (ILUPACK) by off-loading the most expensive computations -i.e., the solution of sparse triangular systems and sparse matrix-vector products-to the hardware accelerator. The results collected using GPUs from the two most recent generations from NVIDIA ("Fermi" and "Kepler") and a benchmark testbed of sparse linear systems show that the GPU-enabled implementations deliver a notable reduction of the execution time, while maintaining the convergence rate and numerical properties of the original ILUPACK solver.

show abstract

Section: Related Workmentioning

confidence: 82%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Leveraging Data-Parallelism in ILUPACK using Graphics Processors

Aliaga

Bollhöfer

Dufrechou

et al. 2014

2014 IEEE 13th International Symposium on Parallel and Distributed Computing

View full text Add to dashboard Cite

show abstract

“…[57,56]: In [56], a sparse CGM is implemented on the GPU. An analytical model is presented which is used for optimising two implementation parameters: the number of threads and the size of the CUDA blocks.…”

Section: Discussionmentioning

confidence: 99%

“…Parameters of the model are the warp size and the number of streaming processors of the GPU, which are machine-specific, as well as the length of each matrix row, which is data-specific. In [57] a model for executing a parallel CGM on multiple GPUs is set up. The model considers the dimension of the problem and the total number of stored elements in the matrix.…”

Section: Discussionmentioning

confidence: 99%

An execution time and energy model for an energy-aware execution of a conjugate gradient method with CPU/GPU collaboration

Lang

Rünger

2014

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

A novel multi–graphics processing unit parallel optimization framework for the sparse matrix‐vector multiplication

Gao

Wang

2016

Concurrency and Computation

View full text Add to dashboard Cite

Summary The sparse matrix‐vector multiplication (SpMV) is of great importance in scientific computations. Graphics processing unit (GPU)‐accelerated SpMVs for large‐sized problems have attracted considerable attention recently. We observe that on a specific multi‐GPU platform, the SpMV performance can usually be greatly improved when a matrix is partitioned into several blocks according to a predetermined rule and each block is assigned to a GPU with an appropriate storage format. This motivates us to propose a novel multi‐GPU parallel SpMV optimization framework, which involves the following parts: (1) a simple rule is defined to divide any given matrix among multiple GPUs; (2) a performance model, which is independent of the problems and dependent on the resources of devices, is proposed to accurately predict the execution time of SpMV kernels; and (3) a selection algorithm is suggested to automatically select the most appropriate one from the storage formats that are involved in the framework for the matrix block that is assigned to each GPU on the basis of the performance model. The objective of our framework does not construct a new storage format or algorithm but automatically and rapidly generates an optimally parallel SpMV for any sparse matrix on a specific multi‐GPU platform by integrating the existing storage formats and their corresponding kernels. We take 5 popular storage formats, for example, to present the idea of constructing the framework. Theoretically, we validate the correctness of our proposed SpMV performance model. This model is constructed only once for each type of GPU. Moreover, this framework is general and easy to be extensible. For a storage format that is not included in our framework, once the performance model of its corresponding SpMV kernel is successfully constructed, it can be incorporated into our framework. The experiments validate the efficiency of our proposed framework.

show abstract

Analysis and performance estimation of the Conjugate Gradient method on multiple GPUs

Cited by 37 publications

References 19 publications

Leveraging Data-Parallelism in ILUPACK using Graphics Processors

Leveraging Data-Parallelism in ILUPACK using Graphics Processors

An execution time and energy model for an energy-aware execution of a conjugate gradient method with CPU/GPU collaboration

A novel multi–graphics processing unit parallel optimization framework for the sparse matrix‐vector multiplication

Contact Info

Product

Resources

About