B. Kumar scite author profile

Huang

Johnson

et al.

In this article, we present a program generation strategy of Strassen's matrix multiplication algorithm using a programming methodology based on tensor product formulas. In this methodology, block recursive programs such as the fast Fourier Transforms and Strassen's matrix multiplication algorithm are expressed as algebraic formulas involving tensor products and other matrix operations. Such formulas can be systematically translated to high-performance parallel/vector codes for various architectures. In this article, we present a nonrecursive implementation of Strassen's algorithm for shared memory vector processors such as the Cray Y-MP. A previous implementation of Stras-sen's algorithm synthesized from tensor product formulas required working storage of size 0(7n) for multiplying 2n x 2n matrices. We present a modified formulation in which the working storage requirement is reduced to 0(4n). The modified formulation exhibits sufficient parallelism for efficient implementation on a shared memory multiprocessor. Performance results on a Cray Y-MPB/64 are presented.

A Tensor Product Formulation of Strassen′s Matrix Multiplication Algorithm with Memory Reduction

Huang

Sadayappan

et al. 1995

Scientific Programming

In this article, we present a program generation strategy of Strassen's matrix multiplication algorithm using a programming methodology based on tensor product formulas. In this methodology, block recursive programs such as the fast Fourier Transforms and Strassen's matrix multiplication algorithm are expressed as algebraic formulas involving tensor products and other matrix operations. Such formulas can be systematically translated to high-performance parallel/vector codes for various architectures. In this article, we present a nonrecursive implementation of Strassen's algorithm for shared memory vector processors such as the Cray Y-MP. A previous implementation of Strassen's algorithm synthesized from tensor product formulas required working storage of size 0(7n) for multiplying 2n x 2n matrices. We present a modified formulation in which the working storage requirement is reduced to 0(4n). The modified formulation exhibits sufficient parallelism for efficient implementation on a shared memory multiprocessor. Performance results on a Cray Y-MPB/64 are presented.

A reordering and mapping algorithm for parallel sparse Cholesky factorization

Eswar

Sadayappan

et al.

A judiciously chosen symmetric permutation can significantly reduce the amount of storage and computation for the Cholesky factorization of sparse matrices. On distributed memory machines, the issue of mapping data and computation on processors is also important. Previous research on ordering for parallelism has focussed on idealized measures like ezecution time on an unbounded number of processors, with zero communication costs. In this paper, we propose an ordering and mapping algorithm that attempts to minimize communication and performs load-balancing of work among the processors. Performance results on an Intel iPSC/860 hypercube are presented to demonstrate its effectiveness.

Parallel Process. Lett.

A Clustering Algorithm for Parallel Sparse Cholesky Factorization

Kumar¹,

Eswar²,

Sadayappan³

et al. 1995

This paper presents an integrated approach to two issues relevant to efficient parallel sparse Cholesky factorization: 1) matrix reordering for parallelism, and, 2) mapping of data to processors. A clustering heuristic is proposed to performs a fill-preserving reordering and mapping of data onto a fixed number of processors. Performance results on a Cray T3D are presented to demonstrate its effectiveness.

On sparse matrix reordering for parallel factorization

Sadayappan

Huang

1994