In this article, we present a program generation strategy of Strassen's matrix multiplication algorithm using a programming methodology based on tensor product formulas. In this methodology, block recursive programs such as the fast Fourier Transforms and Strassen's matrix multiplication algorithm are expressed as algebraic formulas involving tensor products and other matrix operations. Such formulas can be systematically translated to high-performance parallel/vector codes for various architectures. In this article, we present a nonrecursive implementation of Strassen's algorithm for shared memory vector processors such as the Cray Y-MP. A previous implementation of Stras-sen's algorithm synthesized from tensor product formulas required working storage of size 0(7n) for multiplying 2n x 2n matrices. We present a modified formulation in which the working storage requirement is reduced to 0(4n). The modified formulation exhibits sufficient parallelism for efficient implementation on a shared memory multiprocessor. Performance results on a Cray Y-MPB/64 are presented.
In this article, we present a program generation strategy of Strassen's matrix multiplication algorithm using a programming methodology based on tensor product formulas. In this methodology, block recursive programs such as the fast Fourier Transforms and Strassen's matrix multiplication algorithm are expressed as algebraic formulas involving tensor products and other matrix operations. Such formulas can be systematically translated to high-performance parallel/vector codes for various architectures. In this article, we present a nonrecursive implementation of Strassen's algorithm for shared memory vector processors such as the Cray Y-MP. A previous implementation of Strassen's algorithm synthesized from tensor product formulas required working storage of size 0(7n) for multiplying 2n x 2n matrices. We present a modified formulation in which the working storage requirement is reduced to 0(4n). The modified formulation exhibits sufficient parallelism for efficient implementation on a shared memory multiprocessor. Performance results on a Cray Y-MPB/64 are presented.
A judiciously chosen symmetric permutation can significantly reduce the amount of storage and computation for the Cholesky factorization of sparse matrices. On distributed memory machines, the issue of mapping data and computation on processors is also important. Previous research on ordering for parallelism has focussed on idealized measures like ezecution time on an unbounded number of processors, with zero communication costs. In this paper, we propose an ordering and mapping algorithm that attempts to minimize communication and performs load-balancing of work among the processors. Performance results on an Intel iPSC/860 hypercube are presented to demonstrate its effectiveness.
This paper presents an integrated approach to two issues relevant to efficient parallel sparse Cholesky factorization: 1) matrix reordering for parallelism, and, 2) mapping of data to processors. A clustering heuristic is proposed to performs a fill-preserving reordering and mapping of data onto a fixed number of processors. Performance results on a Cray T3D are presented to demonstrate its effectiveness.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.