Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout

Ballard, Grey; Demmel, James; Lipshitz, Benjamin; Schwartz, Oded; Toledo, Sivan

doi:10.1145/2486159.2486198

Cited by 6 publications

(3 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The data layout transformation is equivalent to transforming a matrix in column-major layout to a block-contiguous layout. By applying (for example) the Separate function given as Algorithm 3 in [Ballard et al 2013] to each panel of width Θ( √ M ) a logarithmic number of times, we can convert H from column-major to Θ( √ M )-by-Θ( √ M ) block-contiguous layout with total bandwidth cost O(n 2 log(n/ √ M )) and total latency cost O((n 2 /M ) log(n/ √ M )), which are lower-order terms for n √ M . Note that these two optimizations cannot both be applied straightforwardly to the approach of [Bischof et al 1994], as H will not be written in column-major order when multiple bulges are chased at a time.…”

Section: Algorithmmentioning

confidence: 99%

Avoiding Communication in Successive Band Reduction

Ballard

Demmel

Knight

2015

ACM Trans. Parallel Comput.

Self Cite

View full text Add to dashboard Cite

The running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time. In this work, we present sequential and parallel algorithms for tridiagonalizing a symmetric band matrix that asymptotically reduce communication compared to previous approaches.The tridiagonalization of a symmetric band matrix is a key kernel in solving the symmetric eigenvalue problem for both full and band matrices. In order to preserve sparsity, tridiagonalization routines use annihilate-and-chase procedures that previously have suffered from poor data locality. We improve data locality by reorganizing the computation and obtain asymptotic improvements. We consider the cases of computing eigenvalues only and of computing eigenvalues and all eigenvectors.

show abstract

Section: Algorithmmentioning

confidence: 99%

Avoiding Communication in Successive Band Reduction

Ballard

Demmel

Knight

2015

ACM Trans. Parallel Comput.

Self Cite

View full text Add to dashboard Cite

show abstract

“…The ShapeMorphing LU algorithm (SMLU) [6] is an adaptation of RLU that changes the matrix layout on the fly to reduce latency cost. The algorithm and its analysis are provided in [6], and the communication costs are given in the second row of Table 1. SMLU uses partial pivoting and incurs a slight bandwidth cost overhead compared to RLU (an extra logarithmic factor).…”

Section: Algorithm Wordsmentioning

confidence: 99%

Communication-Avoiding Symmetric-Indefinite Factorization

Ballard¹,

Becker²,

Demmel³

et al. 2014

SIAM J. Matrix Anal. & Appl.

Self Cite

View full text Add to dashboard Cite

Abstract. We describe and analyze a novel symmetric triangular factorization algorithm. The algorithm is essentially a block version of Aasen's triangular tridiagonalization. It factors a dense symmetric matrix A as the product A = P LT L T P T where P is a permutation matrix, L is lower triangular, and T is block tridiagonal and banded. The algorithm is the first symmetric-indefinite communication-avoiding factorization: it performs an asymptotically optimal amount of communication in a two-level memory hierarchy for almost any cache-line size. Adaptations of the algorithm to parallel computers are likely to be communication efficient as well; one such adaptation has been recently published. The current paper describes the algorithm, proves that it is numerically stable, and proves that it is communication optimal.

show abstract

“…The LUPP algorithm is widely used in scientific computing applications, including the solving of linear equations in the benchmark HPL for ranking supercomputers, 2 and is still attracting further investigation and optimization, such as in the communication-avoiding perspective (Ballard et al, 2013). Traditional ABFT methods handle soft errors in matrix operations only at the end of computation (Huang and Abraham, 1984).…”

Section: Introductionmentioning

confidence: 99%

Detection of soft errors in LU decomposition with partial pivoting using algorithm-based fault tolerance

Yao

Zhang

Chen

et al. 2015

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

Soft errors in scientific computing applications are becoming inevitable with the ever-increasing system scale and execution time, and new technologies that feature increased transistor density and lower voltage. Soft errors can be mainly classified into two categories: bit-flipping error (e.g. 1 becomes 21) in random access memory; and computation error (e.g. 1 + 1 = 3) in floating point units. Traditionally, bit-flipping error is handled by the Error Correcting Code (ECC) technique, and computation error is dealt with the Triple Modular Redundancy (TMR) method. Note that, ECC cannot handle computation error, while TMR cannot deal with bit-flipping error and is not efficient on handling computation error. To uniformly and efficiently handle both computation and bit-flipping errors in matrix operations, the AlgorithmBased Fault Tolerance (ABFT) method is developed. This paper focuses on the detection of soft errors in the LU Decomposition with Partial Pivoting (LUPP) algorithm, which is widely used in scientific computing applications. First, this paper notes that existing ABFT methods are not adequate to detect soft errors in LUPP in terms of time or space. Then we propose a new ABFT algorithm which can detect soft errors in LUPP both flexible in time and comprehensive in space. Flexible in time means that soft errors can be detected flexibly during the execution instead of only at the end of LUPP, while comprehensive in space indicates that all of the elements in data matrices (L and U) will be covered for detecting soft errors. To show the feasibility and efficiency of the proposed algorithm, this paper has incorporated it into the implementation of LUPP in the widely used benchmark High Performance Linpack (HPL). Experiment results verify the feasibility of this algorithm: for soft errors injected at various timings and to different elements in LUPP, this algorithm has detected most of the injected errors, which have covered all of the errors that cannot pass the residual check of HPL. Both theoretical overhead analysis and experiments demonstrate that this ABFT algorithm is also very efficient at detecting soft errors in LUPP.

show abstract

Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout

Cited by 6 publications

References 18 publications

Avoiding Communication in Successive Band Reduction

Avoiding Communication in Successive Band Reduction

Communication-Avoiding Symmetric-Indefinite Factorization

Detection of soft errors in LU decomposition with partial pivoting using algorithm-based fault tolerance

Contact Info

Product

Resources

About