A Distributed CPU-GPU Sparse Direct Solver

Sao, Piyush; Vuduc, Richard; Li, Xiaoye Sherry

doi:10.1007/978-3-319-09873-9_41

Cited by 36 publications

(18 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Features and limitations GPU with previously available technique to accelerate large circuit simulation on GPU [4] Transistor model acceleration using if-else in-lining coalesced memory access and optimum kernel size strategy As it was specific to transistor model only, several researchers extended this idea to various circuit acceleration proposals Event-driven and gate-structure partitioning-based circuit acceleration [5,6] Considers those gates that receive event trigger Whole circuit is not processed. Saves execution time Partitioned circuit processing in parallel is possible Design level partitioning needs extra effort and memory Blocking LU decomposition using FPGA [7], pivoting operations on FPGA [8], scalable block LU decomposition technique [9] LU factorization is performed using FPGA algorithm on large sparse matrices Pivoting reduction is applied using FPGA Compared to FPGA system, GPU technique is adaptable and cost effective Approach is challenging to embed in some simulators Fast memory level for LU decomposition on GPU [10] GPU memory levels are utilized efficiently to manage memory access of GPU to accelerate LU decomposition [10] This technique can be extended to combine many others to accelerate speed Solvers like NICSLU, PARDISO [11,12] NICSLU uses column-level dependency graph to propose the task [7] Dependency is taken from the symbolic structure of the matrix factors Elimination tree represents dependency among other factors Techniques such as numerical update, pruning, and pivoting are used for matrix stability These steps can consume more time in matrix operations Pivoting reduction for LU sparse solver [12][13][14][15] Modified version of traditional solvers can be executed in parallel In large circuit, matrix column dependency causes slower simulation KLU data structures and algorithm [16] Gives stability in matrix processing For small circuit simulation, it proves to be slow but has been adopted by many simulators such as NGSPICE Runge-Kutta integrators using GPU [17] Parallel integration execution Can be extensively used in simulator application Distributed approach to solve sparse matrices [18] GPU cluster execution is possible Takes extra time to allocate resources among systems and is costly as compared to GPU-based system Various algorithms of LU decomposition on GPU [19] Effective strategies to implement LU technique on GPU Left looking algorithm has been proved best for parallel implementation on GPU GPU-based LU decomposition algorithm for small matrices [20] Suitable for small matrices and has to partition large matrices LU decomposition algorithm given for dense matrices <...>…”

Section: Sample 1: If-else In-lining Examplementioning

confidence: 99%

GPU accelerated circuit analysis using machine learning-based parallel computing model

Jagtap¹,

Rao²

2020

SN Appl. Sci.

View full text Add to dashboard Cite

Circuit simulators have the capability to create virtual environment to test circuit design. Simulators save time and hardware cost. However, when components in circuit design increase, most simulators take longer time to test large circuit design, in many cases days or even weeks. Therefore, to handle large dataset and accurate performance, simulators need to be improved. In this paper, we propose machine learning-based parallel implementations of circuit analyser on graphics card with Compute Unified Device Architecture (CUDA). After parsing netlist file, the first approach is to analyse compute intensive mathematical functions and then convert it into parallel executable version. Further, we propose a Design-Level Parallelism with hybrid parallel implementation of components and processing methods. Dynamic decision-making is required to select functions and parameters to map on Graphics Processing Unit (GPU). To reduce load overhead, machine learning clustering approach has been adopted. Combination of procedure clustering and mapping takes few cycles but overall performance enhances efficiency as compared to serial processing.

show abstract

Section: Sample 1: If-else In-lining Examplementioning

confidence: 99%

GPU accelerated circuit analysis using machine learning-based parallel computing model

Jagtap¹,

Rao²

2020

SN Appl. Sci.

View full text Add to dashboard Cite

show abstract

“…Murray accelerated Runge-Kutta integrators using GPU [15] that are useful in various integration applications. Saol, Vuducl, and Xiaoye proposed distributed approach to solve sparse matrix [16]. It is costly when compared to GPU-based parallel execution.…”

Section: Related Workmentioning

confidence: 99%

Cluster-Based and GPU-Driven Parallel Computing Model to Accelerate Circuit Simulation

Jagtap*,

Rao

2020

IJITEE

View full text Add to dashboard Cite

In this digital age circuit design, analysis and validation is not only fundamental step but quite crucial in all the industries and in research. Simulation software is available for circuit analysis, but they all prove to be slower for very large circuit simulation or to execute thousands of iteration of transient analysis. Accelerating simulator is as important as speeding up circuit design. In this paper we have addressed circuit analysis using parallel computing approach on Graphics Processing Unit (GPU). Now a day’s high end GPUs are available with sufficient memory in the architecture itself. Circuit processing functions are analysed to search compute intensive functions. Mathematical operations are redesigned so that it will execute in parallel. LU decomposition algorithm and complex math operations are converted in parallel form. Some mathematical operations are simplified to merge them in suitable cluster. Clustering approach is used which helps in finding kernel of uniform operations to map on GPU cores. GPU programming strategies like if-else in-lining, parallel reduction etc are useful in accelerating circuit operations. Use of look up tables in shared memory or constant memory proves to be useful in fast data access. At least 15% speed gain is achieved for operational analysis and 40% for transient analysis of regular circuits.

show abstract

“…The computation time of solving the linear equations is relatively difficult to deal with. Many research efforts have been made to reduce the computation time and memory requirements of linear‐equation solvers for either direct solutions (Sao et al., ; Amestoy et al., ; Davis et al., ) or iterative solutions (Balay et al., ). Various iterative methods have been developed and modified to accelerate convergence.…”

Section: The Proposed Hybrid Solutionmentioning

confidence: 99%

“…Group E mainly includes two abstract classes named linear SOE (system of equations) and linear solver. Linear solver derives many existing well‐known sparse linear systems such as UMFPACK (Davis et al., ), SuperLU (Sao et al., ), MUMPS (Amestoy et al., ), PETSc (Balay et al., ), etc. Linear SOE handles operations between these solvers and OpenSees.…”

Section: Implementation With Openseesmentioning

confidence: 99%

Direct‐Iterative Hybrid Solution in Nonlinear Dynamic Structural Analysis

Yang

Wang

Lin

2017

Computer aided Civil Eng

View full text Add to dashboard Cite

Although many advanced sparse direct solvers are widely used in structural analysis, these often require longer computing times than iterative solvers for well-conditioned structural systems. However, iterative solvers cannot efficiently solve an ill-conditioned system when a structure becomes highly nonlinear. This work proposes a hybrid solution integrating a direct solver and an iterative solver to reduce overall computing time in solving a series of linear equations arising from nonlinear dynamic structural analysis. The hybrid solution selects the iterative solver, which reuses the factorized matrices for preconditioning, and switches to the direct solver in the initial stage or when factorized matrices need updating. The performance of the hybrid solution is tested on the OpenSees platform. The results show that the hybrid solution outperforms the direct solver, even when the structure becomes highly nonlinear during analysis. C 2017 Computer-Aided Civil and Infrastructure Engineering.

show abstract

A Distributed CPU-GPU Sparse Direct Solver

Cited by 36 publications

References 8 publications

GPU accelerated circuit analysis using machine learning-based parallel computing model

GPU accelerated circuit analysis using machine learning-based parallel computing model

Cluster-Based and GPU-Driven Parallel Computing Model to Accelerate Circuit Simulation

Direct‐Iterative Hybrid Solution in Nonlinear Dynamic Structural Analysis

Contact Info

Product

Resources

About