QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism

Frens, Jeremy D.; Wise, David S.

doi:10.1145/966049.781525

Cited by 31 publications

(27 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(In practice one does not recur down to 1-by-1 submatrices because of the high overhead. Also, some cache-oblivious algorithms require a constant factor more arithmetic operations than non-oblivious alternatives [FW03]. So "pure" cache-obliviousness is not a panacea.…”

Section: Algorithmmentioning

confidence: 99%

Minimizing Communication in Numerical Linear Algebra

Ballard¹,

Demmel²,

Holtz³

et al. 2011

SIAM J. Matrix Anal. & Appl.

215

308

View full text Add to dashboard Cite

Abstract. In 1981 Hong and Kung proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, n-by-n matrix-multiplication using the conventional O(n 3 ) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and extended it to the parallel case (where communication means the amount of data moved between processors). In both cases the lower bound may be expressed as Ω(#arithmetic operations / √ M ), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, LDL T factorization, QR factorization, Gram-Schmidt algorithm, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra.The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth-cost), we get lower bounds on the number of messages required to move it (latency-cost).We extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems.We point out recently designed algorithms that attain many of these lower bounds.

show abstract

Section: Algorithmmentioning

confidence: 99%

Minimizing Communication in Numerical Linear Algebra

Ballard¹,

Demmel²,

Holtz³

et al. 2011

SIAM J. Matrix Anal. & Appl.

215

308

View full text Add to dashboard Cite

show abstract

“…When applied to square power-of-two matrices, our choices lead to a standard N -Morton ordering. There are several alternatives for generalizing Morton ordering [7,8,9,12]. The simplest approach is to pad both rows and columns with zeros to obtain a square power-of-two matrix.…”

Section: Data Layoutsmentioning

confidence: 99%

“…However this can increase the number of matrix elements by a factor of 4 times the ratio of large dimension to small dimension. This approach is explored in [9], where the authors avoid the extra space and computation on padded rows and columns using "decorations" which denote full, partial, and zero submatrices. Hybrid layouts are also often used, storing small blocks in column or row-major layout and ordering the blocks using a Morton ordering.…”

Section: Data Layoutsmentioning

confidence: 99%

“…Abandoning the requirement that the orthogonal factor Q be computed with one Householder vector per column allows for a square recursive algorithm for QR [9]. The square recursive algorithm maps nicely onto standard Morton ordering, as each computation involves matrix quadrants.…”

Section: Rectangular Recursive Algorithms For Lu and Qrmentioning

confidence: 99%

“…column, the standard trailing matrix update techniques do not apply. The approach from [9] is to explicitly construct the orthogonal factor Q, using matrix multiplication to update the trailing matrix. This technique leads to an increase in the total flop count of the decomposition compared to the standard algorithm, by a factor of approximately 3×.…”

Section: Rectangular Recursive Algorithms For Lu and Qrmentioning

confidence: 99%

See 2 more Smart Citations

Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout

Ballard

Demmel

Lipshitz

et al. 2013

Proceedings of the Twenty-Fifth Annual ACM Symposium on Parallelism in Algorithms and Architectures

View full text Add to dashboard Cite

High performance for numerical linear algebra often comes at the expense of stability. Computing the LU decomposition of a matrix via Gaussian Elimination can be organized so that the computation involves regular and efficient data access. However, maintaining numerical stability via partial pivoting involves row interchanges that lead to inefficient data access patterns. To optimize communication efficiency throughout the memory hierarchy we confront two seemingly contradictory requirements: partial pivoting is efficient with column-major layout, whereas a recursive layout is optimal for the rest of the computation. We resolve this by introducing a shape morphing procedure that dynamically matches the layout to the computation throughout the algorithm, and show that Gaussian Elimination with partial pivoting can be performed in a communication efficient and cache-oblivious way. Our technique extends to QR decomposition, where computing Householder vectors prefers a different data layout than the rest of the computation.

show abstract

Surrounding Theorem: Developing Parallel Programs for Matrix-Convolutions

Emoto

Matsuzaki

et al. 2006

Euro-Par 2006 Parallel Processing

View full text Add to dashboard Cite

Computations on two-dimensional arrays such as matrices and images are one of the most fundamental and ubiquitous things in computational science and its vast application areas, but development of efficient parallel programs on two-dimensional arrays is known to be hard. To solve this problem, we have proposed a skeletal framework on two-dimensional arrays based on the theory of constructive algorithmics. It supports users, even with little knowledge about parallel machines, to develop systematically both correct and efficient parallel programs on two-dimensional arrays. In this paper, we apply our framework to the matrix-convolutions often used in image filters and difference methods. We show the efficacy of the framework by giving a general parallel program for the matrix-convolutions described with the skeletons, and a theorem that optimizes the general program into an application-specific one.

show abstract

QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism

Cited by 31 publications

References 22 publications

Minimizing Communication in Numerical Linear Algebra

Minimizing Communication in Numerical Linear Algebra

Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout

Surrounding Theorem: Developing Parallel Programs for Matrix-Convolutions

Contact Info

Product

Resources

About