Parallel sparse LU factorization on different message passing platforms

Shen, Kai

doi:10.1016/j.jpdc.2006.07.001

Cited by 5 publications

(3 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This verifies the accuracy of BBCR solver. Secondly, the speed-up ratios of BBCR solver and MUMPS solver are compared in Fig 3. On the one hand, we can see that the MUMPS shows a poor speed up performance, which is consistent with relative research (Shen K, 2006). And it seems that the limit speed-up ratio of MUMPS for this problem is less than 2.…”

Section: Numerical Experimentssupporting

confidence: 87%

“…It can not only simulate multi-shot seismic data efficiently , but also reduce the memory requirement and computing time. However, the speed-up performance of traditional parallel sparse LU solvers, such as MUMPS (Amestoy et al, 2006;MUMPS Team 2015) and superLU, are quite poor (Shen K, 2006). Therefore, new parallel direct solver is required to get better speed-up performance.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A new parallel direct solver for 2D frequency-domain modeling

Gao

2016

SEG Technical Program Expanded Abstracts 2016

View full text Add to dashboard Cite

The parallel direct solver is important for 2D frequencydomain modeling. It can effectively reduce the memory requirement and computing time. And multi-shot seismic data can be simulated efficiently in this way. However the traditional parallel sparse LU direct solvers, such as MUMPS and superLU, have poor speed-up performance. In this paper, we introduce a new parallel direct solver called banded-block cyclic reduction(BBCR), which is developed to get a better speed-up performance and reduce the computing time. Numerical experiments for Marmousi model show that the speed-up ratio of BBCR is obviously superior to the speed-up ratio of MUMPS, and the frequency-domain wavefields got by BBCR and MUMPS are almost the same. Besides, since BBCR makes full use of banded matrix operation, it can almost reduce the computing time of traditional block cyclic reduction(BCR) solver by half.

show abstract

Section: Numerical Experimentssupporting

confidence: 87%

Section: Introductionmentioning

confidence: 99%

A new parallel direct solver for 2D frequency-domain modeling

Gao

2016

SEG Technical Program Expanded Abstracts 2016

View full text Add to dashboard Cite

show abstract

“…For example, Van der Stappen et al [32] present an algorithm for parallel calculation of the LU decomposition on a mesh network of transputers where each processor holds a part of the matrix. Shen [31] evaluates techniques for LU decomposition distributed over nodes that are connected via slow message passing. Dongarra et al [9] demonstrate an optimized implementation of matrix inversion on a single multicore node, focusing on the minimization of synchronization between the different processing cores.…”

Section: Foundations and Related Workmentioning

confidence: 99%

A Massively Parallel Algorithm for the Approximate Calculation of Inverse p-th Roots of Large Sparse Matrices

Lass

Mohr

Wiebeler

et al. 2018

Proceedings of the Platform for Advanced Scientific Computing Conference

View full text Add to dashboard Cite

We present the submatrix method, a highly parallelizable method for the approximate calculation of inverse p-th roots of large sparse symmetric matrices which are required in different scientific applications. We follow the idea of Approximate Computing, allowing imprecision in the final result in order to be able to utilize the sparsity of the input matrix and to allow massively parallel execution. For an n × n matrix, the proposed algorithm allows to distribute the calculations over n nodes with only little communication overhead. The approximate result matrix exhibits the same sparsity pattern as the input matrix, allowing for efficient reuse of allocated data structures.We evaluate the algorithm with respect to the error that it introduces into calculated results, as well as its performance and scalability. We demonstrate that the error is relatively limited for well-conditioned matrices and that results are still valuable for errorresilient applications like preconditioning even for ill-conditioned matrices. We discuss the execution time and scaling of the algorithm on a theoretical level and present a distributed implementation of the algorithm using MPI and OpenMP. We demonstrate the scalability of this implementation by running it on a high-performance compute cluster comprised of 1024 CPU cores, showing a speedup of 665× compared to single-threaded execution.

show abstract