2016
DOI: 10.1016/j.procs.2016.05.303
|View full text |Cite
|
Sign up to set email alerts
|

Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2016
2016
2021
2021

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 13 publications
(5 citation statements)
references
References 7 publications
0
5
0
Order By: Relevance
“…Since threads within a CUDA warp execute the same instructions in a lock-step fashion, thread divergence within a warp is expensive, because it forces diverging threads to idle status while other lock-step synchronized threads compute, thus wasting valuable core cycles. e authors in [Abdelfa ah et al 2016d] design two variants of batched Cholesky factorization to cope with the triangular nature of the matrix: loop-inclusive and loop-exclusive. In the rst, all factorization iterations are executed in one kernel to maximize chances of data reuse.…”
Section: Limitations Of Triangular Dla Operationsmentioning
confidence: 99%
“…Since threads within a CUDA warp execute the same instructions in a lock-step fashion, thread divergence within a warp is expensive, because it forces diverging threads to idle status while other lock-step synchronized threads compute, thus wasting valuable core cycles. e authors in [Abdelfa ah et al 2016d] design two variants of batched Cholesky factorization to cope with the triangular nature of the matrix: loop-inclusive and loop-exclusive. In the rst, all factorization iterations are executed in one kernel to maximize chances of data reuse.…”
Section: Limitations Of Triangular Dla Operationsmentioning
confidence: 99%
“…There has been a recent push for implementing fast GPU implementations of dense Cholesky decompositions with the developments in the area of the so-called batched linear algebra, where the goal is to solve a large number of smaller problems in parallel (Abdelfattah, Haidar, Tomov, and Dongarra 2016;Dongarra, Duff, Gates, Haidar, Hammarling, Higham, Hogg, Lara, Relton, Tomov, and Zounon 2016). The developments in this area have also brought advances for the native mode Cholesky decomposition of a single large matrix.…”
Section: Primitive Functionmentioning
confidence: 99%
“…However, recent need in many applications for many independent linear algebra problems of small sizes motivated the development of the so-called batched linear algebra algorithms [9,14]. Batched LU, QR, and Cholesky were developed for both fixed matrix sizes [7,8,15] and variable sizes [1,2] that are GPU-only. The reason for developing them for GPUs only is that the sizes were so small that there was not enough computation for the GPU work to overlap the expensive CPU-to-GPU communications.…”
Section: Related Workmentioning
confidence: 99%
“…The reason for developing them for GPUs only is that the sizes were so small that there was not enough computation for the GPU work to overlap the expensive CPU-to-GPU communications. Regardless of the motivation, since they were developed, it was possible to easily extend them to compute single large factorizations for GPU-only execution [2,18]. Rather than these early implementations that resulted from highly-optimized batched factorizations for small problems, in this paper we concentrate on and study in detail specifically GPU-only algorithms.…”
Section: Related Workmentioning
confidence: 99%