Sparse LU factorization with partial pivoting on distributed memory machines

Fu, Cong; Yang, Tao

doi:10.1145/369028.369092

Cited by 16 publications

(25 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the previous work, we show that static factorization does not produce too many fill-ins for most of our test matrices, even for large matrices using a simple matrix ordering strategy (minimum degree ordering) [10,11]. For a few matrices that we have tested, static factorization generates an excessive number of fill-ins.…”

mentioning

confidence: 75%

“…Most notably, the recent shared memory implementation of SuperLU has achieved up to 2.58 GFLOPS on 8 Cray C90 nodes [4,5,23]. For distributed memory machines, we proposed an approach that adopts a static symbolic factorization scheme to avoid data structure variation [10,11]. Static symbolic factorization eliminates the runtime overhead of dynamic symbolic factorization with a price of overestimated fill-ins and, thereafter, extra computation [15].…”

Section: Ams Subject Classifications 65f50 65f05mentioning

confidence: 99%

“…Since coarsegrain partitioning can reduce available parallelism and produce large submatrices that do not fit into the cache, an upper bound on the supernode size is usually enforced in the L/U supernode partitioning. After the L/U supernode partitioning, each diagonal submatrix is dense, and each nonzero off-diagonal submatrix in the L part contains only dense subrows, and furthermore, each nonzero submatrix in the U part of A contains only dense subcolumns [11]. This is the key to maximize the use of BLAS-3 subroutines [7] in our algorithm.…”

mentioning

confidence: 99%

See 2 more Smart Citations

S⁺: Efficient 2D Sparse LU Factorization on Parallel Machines

Shen¹,

Yang²,

Jiao³

2000

SIAM J. Matrix Anal. & Appl.

Self Cite

View full text Add to dashboard Cite

Abstract. Static symbolic factorization coupled with supernode partitioning and asynchronous computation scheduling can achieve high gigaflop rates for parallel sparse LU factorization with partial pivoting. This paper studies properties of elimination forests and uses them to optimize supernode partitioning/amalgamation and execution scheduling. It also proposes supernodal matrix multiplication to speed up kernel computation by retaining the BLAS-3 level efficiency and avoiding unnecessary arithmetic operations. The experiments show that our new design with proper space optimization, called S + , improves our previous solution substantially and can achieve up to 10 GFLOPS on 128 Cray T3E 450MHz nodes.Key words. Gaussian elimination with partial pivoting, LU factorization, sparse matrices, elimination forests, supernode amalgamation and partitioning, asynchronous computation scheduling AMS subject classifications. 65F50, 65F05PII. S08954798983373851. Introduction. The solution of sparse linear systems is a computational bottleneck in many scientific computing problems. When dynamic pivoting is required to maintain numerical stability in direct methods for solving nonsymmetric linear systems, it is challenging to develop high performance parallel code because pivoting causes severe caching miss and load imbalance on modern architectures with memory hierarchies. The previous work has addressed parallelization on shared memory platforms or with restricted pivoting [4,13,15,19]. Most notably, the recent shared memory implementation of SuperLU has achieved up to 2.58 GFLOPS on 8 Cray C90 nodes [4,5,23]. For distributed memory machines, we proposed an approach that adopts a static symbolic factorization scheme to avoid data structure variation [10,11]. Static symbolic factorization eliminates the runtime overhead of dynamic symbolic factorization with a price of overestimated fill-ins and, thereafter, extra computation [15]. However, the static data structure allowed us to identify data regularity, maximize the use of BLAS-3 operations, and utilize task graph scheduling techniques and efficient runtime support [12] to achieve high efficiency. This paper addresses three issues to further improve the performance of parallel sparse LU factorization with partial pivoting on distributed memory machines. First, we study the use of elimination trees in optimizing matrix partitioning and task scheduling. Elimination trees or forests are used extensively in sparse Cholesky factorization [18,26,27] because they have a more compact representation of parallelism than task graphs. For sparse LU factorization, the traditional approach uses the elimination tree of A T A, which can produce excessive false computational dependency. In this paper, we use the elimination trees (forest) of A to guide matrix

show abstract

mentioning

confidence: 75%

Section: Ams Subject Classifications 65f50 65f05mentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

S⁺: Efficient 2D Sparse LU Factorization on Parallel Machines

Shen¹,

Yang²,

Jiao³

2000

SIAM J. Matrix Anal. & Appl.

Self Cite

View full text Add to dashboard Cite

show abstract

“…We examine how the computation-dominating part of the LU algorithm is e ciently implemented using the level of Perform Update2D(k m) for blocks this processor owns (10) if column block m is not factorized and all m's child supernodes have been factorized then (11) Perform Factor(m) for blocks this processor owns (12) endif (13) for j = m + 1 to N (14) if my cno = j mod pc then (15) Perform Update2D(k j) for blocks this processor owns (16) endif (17) endfor (18) endfor There could be several approaches to circumvent the above problem: One approach is to use the mixture of BLAS-1/2/3 routines. If A i k and Ai j have the same row sparse structure, and A k j and Ai j have the same column sparse structure, BLAS-3 GEMM can be directly used to modify Ai j.…”

Section: Implementation With Supernodal Gemm Kernelmentioning

confidence: 99%

“…Our previous study 8,10] shows that even with the introduction of extra nonzero elements by static symbolic factorization, the performance of the S sequential code can still be competitive to SuperLU because we are able to use more BLAS-3 operations. Table 3 and the improvement ratio in terms of MFLOPS vary from 16% to 116%, in average more than 50%.…”

Section: Overall Code Performancementioning

confidence: 99%

Elimination forest guided 2D sparse LU factorization

Shen

Jiao

Yang

1998

Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures - SPAA '98

Self Cite

View full text Add to dashboard Cite

Sparse LU factorization with partial pivoting is important for many scienti c applications and delivering high performance for this problem is di cult on distributed memory machines. Our previous work has developed an approach called S that incorporates static symbolic factorization, supernode partitioning and graph scheduling. This paper studies the properties of elimination forests and uses them to guide supernode partitioning/amalgamation and execution scheduling. The new design with 2D mapping e ectively identi es dense structures without introducing too many zeros in the BLAS computation and exploits asynchronous parallelism with low bu er space cost. The implementation of this code, called S + , uses supernodal matrix multiplication which retains the BLAS-3 level e ciency and avoids unnecessary arithmetic operations. The experiments show that S + improves our previous code substantially and can achieve up to 11.04GFLOPS on 128 Cray T3E 450MHz nodes, which is the highest performance reported in the literature.

show abstract

Sparse LU factorization with partial pivoting on distributed memory machines

Cited by 16 publications

References 29 publications

S⁺: Efficient 2D Sparse LU Factorization on Parallel Machines

S⁺: Efficient 2D Sparse LU Factorization on Parallel Machines

Elimination forest guided 2D sparse LU factorization

Run-Time Techniques for Exploiting Irregular Task Parallelism on Distributed Memory Architectures

Contact Info

Product

Resources

About

Sparse LU factorization with partial pivoting on distributed memory machines

Cited by 16 publications

References 29 publications

S+: Efficient 2D Sparse LU Factorization on Parallel Machines

S+: Efficient 2D Sparse LU Factorization on Parallel Machines

Elimination forest guided 2D sparse LU factorization

Run-Time Techniques for Exploiting Irregular Task Parallelism on Distributed Memory Architectures

Contact Info

Product

Resources

About

S⁺: Efficient 2D Sparse LU Factorization on Parallel Machines

S⁺: Efficient 2D Sparse LU Factorization on Parallel Machines