Proceedings of the Twenty-Fifth Annual ACM Symposium on Parallelism in Algorithms and Architectures 2013
DOI: 10.1145/2486159.2486198
|View full text |Cite
|
Sign up to set email alerts
|

Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout

Abstract: High performance for numerical linear algebra often comes at the expense of stability. Computing the LU decomposition of a matrix via Gaussian Elimination can be organized so that the computation involves regular and efficient data access. However, maintaining numerical stability via partial pivoting involves row interchanges that lead to inefficient data access patterns. To optimize communication efficiency throughout the memory hierarchy we confront two seemingly contradictory requirements: partial pivoting … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2014
2014
2017
2017

Publication Types

Select...
4
2

Relationship

2
4

Authors

Journals

citations
Cited by 6 publications
(3 citation statements)
references
References 18 publications
0
3
0
Order By: Relevance
“…The data layout transformation is equivalent to transforming a matrix in column-major layout to a block-contiguous layout. By applying (for example) the Separate function given as Algorithm 3 in [Ballard et al 2013] to each panel of width Θ( √ M ) a logarithmic number of times, we can convert H from column-major to Θ( √ M )-by-Θ( √ M ) block-contiguous layout with total bandwidth cost O(n 2 log(n/ √ M )) and total latency cost O((n 2 /M ) log(n/ √ M )), which are lower-order terms for n √ M . Note that these two optimizations cannot both be applied straightforwardly to the approach of [Bischof et al 1994], as H will not be written in column-major order when multiple bulges are chased at a time.…”
Section: Algorithmmentioning
confidence: 99%
“…The data layout transformation is equivalent to transforming a matrix in column-major layout to a block-contiguous layout. By applying (for example) the Separate function given as Algorithm 3 in [Ballard et al 2013] to each panel of width Θ( √ M ) a logarithmic number of times, we can convert H from column-major to Θ( √ M )-by-Θ( √ M ) block-contiguous layout with total bandwidth cost O(n 2 log(n/ √ M )) and total latency cost O((n 2 /M ) log(n/ √ M )), which are lower-order terms for n √ M . Note that these two optimizations cannot both be applied straightforwardly to the approach of [Bischof et al 1994], as H will not be written in column-major order when multiple bulges are chased at a time.…”
Section: Algorithmmentioning
confidence: 99%
“…The ShapeMorphing LU algorithm (SMLU) [6] is an adaptation of RLU that changes the matrix layout on the fly to reduce latency cost. The algorithm and its analysis are provided in [6], and the communication costs are given in the second row of Table 1. SMLU uses partial pivoting and incurs a slight bandwidth cost overhead compared to RLU (an extra logarithmic factor).…”
Section: Algorithm Wordsmentioning
confidence: 99%
“…The LUPP algorithm is widely used in scientific computing applications, including the solving of linear equations in the benchmark HPL for ranking supercomputers, 2 and is still attracting further investigation and optimization, such as in the communication-avoiding perspective (Ballard et al, 2013). Traditional ABFT methods handle soft errors in matrix operations only at the end of computation (Huang and Abraham, 1984).…”
Section: Introductionmentioning
confidence: 99%