Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs

Abdelfattah, Ahmad; Haidar, Azzam; Tomov, Stanimire; Dongarra, Jack

doi:10.1016/j.procs.2016.05.303

Cited by 13 publications

(5 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since threads within a CUDA warp execute the same instructions in a lock-step fashion, thread divergence within a warp is expensive, because it forces diverging threads to idle status while other lock-step synchronized threads compute, thus wasting valuable core cycles. e authors in [Abdelfa ah et al 2016d] design two variants of batched Cholesky factorization to cope with the triangular nature of the matrix: loop-inclusive and loop-exclusive. In the rst, all factorization iterations are executed in one kernel to maximize chances of data reuse.…”

Section: Limitations Of Triangular Dla Operationsmentioning

confidence: 99%

Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

Charara

Keyes

Ltaief

2019

ACM Trans. Math. Softw.

View full text Add to dashboard Cite

Batched dense linear algebra kernels are becoming ubiquitous in scienti c applications, ranging from tensor contractions in deep learning to data compression in hierarchical low-rank matrix approximation. Within a single API call, these kernels are capable of simultaneously launching up to thousands of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the occupancy of the underlying hardware. A challenge is that for the existing hardware landscape (x86, GPUs, etc.), only a subset of the required batched operations is implemented by the vendors, with limited support for very small problem sizes. We describe the design and performance of a new class of batched triangular dense linear algebra kernels on very small data sizes (up to 256) using single and multiple GPUs. By deploying two-sided recursive formulations, stressing the register usage, maintaining data locality, reducing threads synchronization and fusing successive kernel calls, the new batched kernels outperform existing state-of-the-art implementations.A. Charara et al.However, some of the most critical applications currently of high interests in the HPC community, especially in data analytics, face a major performance bo leneck due to the inadequacy of legacy BLAS/LAPACK frameworks. For instance, tensor contractions for deep learning and hierarchical low rank data-sparse matrix computations [Hackbusch 1999;Hackbusch and Khoromskij 2000] are key operations for solving partial di erential equations. Interestingly, the bulk of the computation of these operations typically resides in performing thousands of independent dense linear algebra operations on very small sizes (usually less than 100). Even the highly vendor-optimized sequential implementations may not cope with the overhead of the memory latency at these tiny sizes. Moreover, calling the sequential version of the dense linear algebra functions within an embarrassingly parallel OpenMP loop may not be an option, due to the API overhead (i.e., parameters sanity check, memory initialization, etc.), which does not get compensated in return because of the low arithmetic intensity of the kernel operations. is is further exacerbated by hardware with a large number of threads, such as GPUs with many streaming multiprocessors, for which high occupancy may not be reached, bandwidth may not get saturated, and thread parallelism may not be exploited given the small workloads. At present, vendors currently provide only a subset of the overall batched linear algebra operations, with limited support for very small problem sizes.is paper describes the high-performance implementations on GPUs of various batched triangular dense linear algebra operations targeting very small sizes (up to 256 in dimension), which are currently either poorly supported or not at all. ere are two main algorithmic adaptations, which may address this challenge: designing synchronization-reducing (i.e., strong scaling) and communication-reducing (i.e., data motion avoiding) algorithms. Alth...

show abstract

Section: Limitations Of Triangular Dla Operationsmentioning

confidence: 99%

Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

Charara

Keyes

Ltaief

2019

ACM Trans. Math. Softw.

View full text Add to dashboard Cite

show abstract

“…There has been a recent push for implementing fast GPU implementations of dense Cholesky decompositions with the developments in the area of the so-called batched linear algebra, where the goal is to solve a large number of smaller problems in parallel (Abdelfattah, Haidar, Tomov, and Dongarra 2016;Dongarra, Duff, Gates, Haidar, Hammarling, Higham, Hogg, Lara, Relton, Tomov, and Zounon 2016). The developments in this area have also brought advances for the native mode Cholesky decomposition of a single large matrix.…”

Section: Primitive Functionmentioning

confidence: 99%

GPU-based Parallel Computation Support for Stan

Češnovar,

Bronder,

Sluga

et al. 2019

Preprint

View full text Add to dashboard Cite

This paper details an extensible OpenCL framework that allows Stan to utilize heterogeneous compute devices. It includes GPU-optimized routines for the Cholesky decomposition, its derivative, other matrix algebra primitives and some commonly used likelihoods, with more additions planned for the near future. Stan users can now benefit from speedups offered by GPUs with little effort and without changes to their existing Stan code. We demonstrate the practical utility of our work with two examples -logistic regression and Gaussian process regression.

show abstract

“…However, recent need in many applications for many independent linear algebra problems of small sizes motivated the development of the so-called batched linear algebra algorithms [9,14]. Batched LU, QR, and Cholesky were developed for both fixed matrix sizes [7,8,15] and variable sizes [1,2] that are GPU-only. The reason for developing them for GPUs only is that the sizes were so small that there was not enough computation for the GPU work to overlap the expensive CPU-to-GPU communications.…”

Section: Related Workmentioning

confidence: 99%

“…The reason for developing them for GPUs only is that the sizes were so small that there was not enough computation for the GPU work to overlap the expensive CPU-to-GPU communications. Regardless of the motivation, since they were developed, it was possible to easily extend them to compute single large factorizations for GPU-only execution [2,18]. Rather than these early implementations that resulted from highly-optimized batched factorizations for small problems, in this paper we concentrate on and study in detail specifically GPU-only algorithms.…”

Section: Related Workmentioning

confidence: 99%

High-performance Cholesky factorization for GPU-only execution

Haidar

Abdelfatah

Tomov

et al. 2017

Proceedings of the General Purpose GPUs

Self Cite

View full text Add to dashboard Cite

We present our performance analysis, algorithm designs, and the optimizations needed for the development of high-performance GPU-only algorithms, and in particular, for the dense Cholesky factorization. In contrast to currently promoted designs that solve parallelism challenges on multicore architectures by representing algorithms as Directed Acyclic Graphs (DAGs), where nodes are tasks of fine granularity and edges are the dependencies between the tasks, our designs explicitly target manycore architectures like GPUs and feature coarse granularity tasks (that can be hierarchically split into fine grain data-parallel subtasks). Furthermore, in contrast to hybrid algorithms that schedule difficult to parallelize tasks on CPUs, we develop highly-efficient code for entirely GPU execution. GPU-only codes remove the expensive CPU-to-GPU communications and the tuning challenges related to slow CPU and/or low CPU-to-GPU bandwidth. We show that on latest GPUs, like the P100, this becomes so important that the GPU-only code even outperforms the hybrid MAGMA algorithms when the CPU tasks and communications can not be entirely overlapped with GPU computations. We achieve up to 4, 300 GFlop/s in double precision on a P100 GPU, which is about 7-8× faster than high-end multicore CPUs, e.g., two 10-cores Intel Xeon E5-2650 v3 Haswell CPUs, where MKL runs up to about 500-600 Gflop/s. The new algorithm also outperforms significantly the GPU-only implementation currently available in the NVIDIA cuSOLVER library. CCS CONCEPTS •General and reference →Design; Performance; •Theory of computation →Algorithm design techniques; •Computing methodologies →Linear algebra algorithms; Optimization

show abstract

Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs

Cited by 13 publications

References 7 publications

Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

GPU-based Parallel Computation Support for Stan

High-performance Cholesky factorization for GPU-only execution

Contact Info

Product

Resources

About