2014
DOI: 10.1007/978-3-319-09873-9_41
|View full text |Cite
|
Sign up to set email alerts
|

A Distributed CPU-GPU Sparse Direct Solver

Abstract: This paper presents the first hybrid MPI+OpenMP+CUDA implementation of a distributed memory right-looking unsymmetric sparse direct solver (i.e., sparse LU factorization) that uses static pivoting. While BLAS calls can account for more than 40% of the overall factorization time, the difficulty is that small problem sizes dominate the workload, making efficient GPU utilization challenging. This fact motivates our approach, which is to find ways to aggregate collections of small BLAS operations into larger ones;… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
18
0

Year Published

2014
2014
2024
2024

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 36 publications
(18 citation statements)
references
References 8 publications
0
18
0
Order By: Relevance
“…Features and limitations GPU with previously available technique to accelerate large circuit simulation on GPU [4] Transistor model acceleration using if-else in-lining coalesced memory access and optimum kernel size strategy As it was specific to transistor model only, several researchers extended this idea to various circuit acceleration proposals Event-driven and gate-structure partitioning-based circuit acceleration [5,6] Considers those gates that receive event trigger Whole circuit is not processed. Saves execution time Partitioned circuit processing in parallel is possible Design level partitioning needs extra effort and memory Blocking LU decomposition using FPGA [7], pivoting operations on FPGA [8], scalable block LU decomposition technique [9] LU factorization is performed using FPGA algorithm on large sparse matrices Pivoting reduction is applied using FPGA Compared to FPGA system, GPU technique is adaptable and cost effective Approach is challenging to embed in some simulators Fast memory level for LU decomposition on GPU [10] GPU memory levels are utilized efficiently to manage memory access of GPU to accelerate LU decomposition [10] This technique can be extended to combine many others to accelerate speed Solvers like NICSLU, PARDISO [11,12] NICSLU uses column-level dependency graph to propose the task [7] Dependency is taken from the symbolic structure of the matrix factors Elimination tree represents dependency among other factors Techniques such as numerical update, pruning, and pivoting are used for matrix stability These steps can consume more time in matrix operations Pivoting reduction for LU sparse solver [12][13][14][15] Modified version of traditional solvers can be executed in parallel In large circuit, matrix column dependency causes slower simulation KLU data structures and algorithm [16] Gives stability in matrix processing For small circuit simulation, it proves to be slow but has been adopted by many simulators such as NGSPICE Runge-Kutta integrators using GPU [17] Parallel integration execution Can be extensively used in simulator application Distributed approach to solve sparse matrices [18] GPU cluster execution is possible Takes extra time to allocate resources among systems and is costly as compared to GPU-based system Various algorithms of LU decomposition on GPU [19] Effective strategies to implement LU technique on GPU Left looking algorithm has been proved best for parallel implementation on GPU GPU-based LU decomposition algorithm for small matrices [20] Suitable for small matrices and has to partition large matrices LU decomposition algorithm given for dense matrices <...>…”
Section: Sample 1: If-else In-lining Examplementioning
confidence: 99%
“…Features and limitations GPU with previously available technique to accelerate large circuit simulation on GPU [4] Transistor model acceleration using if-else in-lining coalesced memory access and optimum kernel size strategy As it was specific to transistor model only, several researchers extended this idea to various circuit acceleration proposals Event-driven and gate-structure partitioning-based circuit acceleration [5,6] Considers those gates that receive event trigger Whole circuit is not processed. Saves execution time Partitioned circuit processing in parallel is possible Design level partitioning needs extra effort and memory Blocking LU decomposition using FPGA [7], pivoting operations on FPGA [8], scalable block LU decomposition technique [9] LU factorization is performed using FPGA algorithm on large sparse matrices Pivoting reduction is applied using FPGA Compared to FPGA system, GPU technique is adaptable and cost effective Approach is challenging to embed in some simulators Fast memory level for LU decomposition on GPU [10] GPU memory levels are utilized efficiently to manage memory access of GPU to accelerate LU decomposition [10] This technique can be extended to combine many others to accelerate speed Solvers like NICSLU, PARDISO [11,12] NICSLU uses column-level dependency graph to propose the task [7] Dependency is taken from the symbolic structure of the matrix factors Elimination tree represents dependency among other factors Techniques such as numerical update, pruning, and pivoting are used for matrix stability These steps can consume more time in matrix operations Pivoting reduction for LU sparse solver [12][13][14][15] Modified version of traditional solvers can be executed in parallel In large circuit, matrix column dependency causes slower simulation KLU data structures and algorithm [16] Gives stability in matrix processing For small circuit simulation, it proves to be slow but has been adopted by many simulators such as NGSPICE Runge-Kutta integrators using GPU [17] Parallel integration execution Can be extensively used in simulator application Distributed approach to solve sparse matrices [18] GPU cluster execution is possible Takes extra time to allocate resources among systems and is costly as compared to GPU-based system Various algorithms of LU decomposition on GPU [19] Effective strategies to implement LU technique on GPU Left looking algorithm has been proved best for parallel implementation on GPU GPU-based LU decomposition algorithm for small matrices [20] Suitable for small matrices and has to partition large matrices LU decomposition algorithm given for dense matrices <...>…”
Section: Sample 1: If-else In-lining Examplementioning
confidence: 99%
“…Murray accelerated Runge-Kutta integrators using GPU [15] that are useful in various integration applications. Saol, Vuducl, and Xiaoye proposed distributed approach to solve sparse matrix [16]. It is costly when compared to GPU-based parallel execution.…”
Section: Related Workmentioning
confidence: 99%
“…The computation time of solving the linear equations is relatively difficult to deal with. Many research efforts have been made to reduce the computation time and memory requirements of linear‐equation solvers for either direct solutions (Sao et al., ; Amestoy et al., ; Davis et al., ) or iterative solutions (Balay et al., ). Various iterative methods have been developed and modified to accelerate convergence.…”
Section: The Proposed Hybrid Solutionmentioning
confidence: 99%
“…Group E mainly includes two abstract classes named linear SOE (system of equations) and linear solver. Linear solver derives many existing well‐known sparse linear systems such as UMFPACK (Davis et al., ), SuperLU (Sao et al., ), MUMPS (Amestoy et al., ), PETSc (Balay et al., ), etc. Linear SOE handles operations between these solvers and OpenSees.…”
Section: Implementation With Openseesmentioning
confidence: 99%