LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU

Dong, Tingxing; Haidar, Azzam; Łuszczek, Piotr; Harris, J. Austin; Tomov, Stanimire; Dongarra, Jack

doi:10.1109/hpcc.2014.30

Cited by 42 publications

(23 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The principle is to have a for loop iterating over the matrices, and within this loop, compute the factorization of the matrix. This is also the approach used in [5], [6].…”

Section: Batchmentioning

confidence: 99%

Batched Cholesky factorization for tiny matrices

Lemaître

Lacassagne

2016

2016 Conference on Design and Architectures for Signal and Image Processing (DASIP)

View full text Add to dashboard Cite

Abstract-Many linear algebra libraries, such as the Intel MKL, Magma or Eigen, provide fast Cholesky factorization. These libraries are suited for big matrices but perform slowly on small ones. Even though State-of-the-Art studies begin to take an interest in small matrices, they usually feature a few hundreds rows. Fields like Computer Vision or High Energy Physics use tiny matrices. In this paper we show that it is possible to speedup the Cholesky factorization for tiny matrices by grouping them in batches and using highly specialized code. We provide High Level Transformations that accelerate the factorization for current Intel SIMD architectures (SSE, AVX2, KNC, AVX512). We achieve with these transformations combined with SIMD a speedup from 13 to 31 for the whole resolution compared to the naive code on a single core AVX2 machine and a speedup from 15 to 33 with multithreading compared to the multithreaded naive code.

show abstract

“…The principle is to have a for loop iterating over the matrices, and within this loop, compute the factorization of the matrix. This is also the approach used in [5], [6].…”

Section: Batchmentioning

confidence: 99%

Batched Cholesky factorization for tiny matrices

Lemaître

Lacassagne

2016

2016 Conference on Design and Architectures for Signal and Image Processing (DASIP)

View full text Add to dashboard Cite

show abstract

“…The use of batched algorithms [8] to launch multiple network integration kernels for each GPU streaming multiprocessor.…”

Section: Leveraging Modern Hardware: Gpu Accelerationmentioning

confidence: 99%

Efficient GPU Accelerationfor Integrating Large Thermonuclear Networks in Astrophysics

Guidry

2016

EPJ Web of Conferences

View full text Add to dashboard Cite

Abstract. We demonstrate the systematic implementation of recently-developed fast explicit kinetic integration algorithms on modern graphics processing unit (GPU) accelerators. We take as representative test cases Type Ia supernova explosions with extremely stiff thermonuclear reaction networks having 150-365 isotopic species and 1600-4400 reactions, assumed coupled to hydrodynamics using operator splitting. In such examples we demonstrate the capability to integrate independent thermonuclear networks from ∼250-500 hydro zones (assumed to be deployed on CPU cores) in parallel on a single GPU in the same wall clock time that standard implicit methods can integrate the network for a single zone. This two or more orders of magnitude increase in efficiency for solving systems of realistic thermonuclear networks coupled to fluid dynamics implies that important coupled, multiphysics problems in various scientific and technical disciplines that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible. As examples of such applications I will discuss our ongoing deployment of these new methods for Type Ia supernova explosions in astrophysics and for simulation of the complex atmospheric chemistry entering into weather and climate problems.

show abstract

“…Some vendors have started to provide some batched functionalities in their numerical libraries (e.g., NVIDIA's CUBLAS and Intel's Math Kernel Library [MKL]). Additionally, some open-source libraries from the HPC community (e.g., the Matrix Algebra on GPU and Multicore Architectures [MAGMA] library [37]) have also started to deliver batched routines [11], [12], [18]. While performance has been improving with these contributions, there is still a lack of understanding of how to design, implement, analyze, and optimize batched routines to exploit modern architectures at full efficiency.…”

Section: Introductionmentioning

confidence: 99%

A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations

Haidar

Abdelfattah

Zounon

et al. 2018

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

Abstract-We present a high-performance GPU kernel with a substantial speedup over vendor libraries for very small matrix computations. In addition, we discuss most of the challenges that hinder the design of efficient GPU kernels for small matrix algorithms. We propose relevant algorithm analysis to harness the full power of a GPU, and strategies for predicting the performance, before introducing a proper implementation. We develop a theoretical analysis and a methodology for high-performance linear solvers for very small matrices. As test cases, we take the Cholesky and LU factorizations and show how the proposed methodology enables us to achieve a performance close to the theoretical upper bound of the hardware. This work investigates and proposes novel algorithms for designing highly optimized GPU kernels for solving batches of hundreds of thousands of small-size Cholesky and LU factorizations. Our focus on efficient batched Cholesky and batched LU kernels is motivated by the increasing need for these kernels in scientific simulations (e.g., astrophysics applications). Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. The proposed GPU kernels achieve performance speedups vs. CUBLAS of up to 6× for the factorizations, using double precision arithmetic on an NVIDIA Pascal P100 GPU.

show abstract

LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU

Cited by 42 publications

References 2 publications

Batched Cholesky factorization for tiny matrices

Batched Cholesky factorization for tiny matrices

Efficient GPU Accelerationfor Integrating Large Thermonuclear Networks in Astrophysics

A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations

Contact Info

Product

Resources

About