A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction

Kumar, B.; Huang, Chien-Hung; Johnson, Ryan F.; Sadayappan, P.

doi:10.1109/ipps.1993.262814

Cited by 22 publications

(10 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In particular, numerical linear algebra based on Strassen's algorithm (if numerical stability issues have been considered acceptable) should clearly benefit from most of its results. Related work on the parallelization of the sub-cubic numerical linear algebra include [1,24,6,25,2].…”

Section: Introductionmentioning

confidence: 99%

Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination

et al. 2016

View full text Add to dashboard Cite

We present block algorithms and their implementation for the parallelization of sub-cubic Gaussian elimination on shared memory architectures. Contrarily to the classical cubic algorithms in parallel numerical linear algebra, we focus here on recursive algorithms and coarse grain parallelization. Indeed, sub-cubic matrix arithmetic can only be achieved through recursive algorithms making coarse grain block algorithms perform more efficiently than fine grain ones. This work is motivated by the design and implementation of dense linear algebra over a finite field, where fast matrix multiplication is used extensively and where costly modular reductions also advocate for coarse grain block decomposition. We incrementally build efficient kernels, for matrix multiplication first, then triangular system solving, on top of which a recursive PLUQ decomposition algorithm is built. We study the parallelization of these kernels using several algorithmic variants: either iterative or recursive and using different splitting strategies. Experiments show that recursive adaptive methods for matrix multiplication, hybrid recursive-iterative methods for triangular system solve and tile recursive versions of the PLUQ decomposition, together with various data mapping policies, provide the best performance on a 32 cores NUMA architecture. Overall, we show that the overhead of modular reductions is more than compensated by the fast linear algebra algorithms and that exact dense linear algebra matches the performance of full rank reference numerical software even in the presence of rank deficiencies.

show abstract

Section: Introductionmentioning

confidence: 99%

Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination

et al. 2016

View full text Add to dashboard Cite

show abstract

“…There are several sequential implementations of Strassen's fast matrix multiplication algorithm [2,11,17], and parallel versions have been implemented for both shared-memory [9,25] and distributedmemory architectures [3,13]. For our parallel algorithms in Section 4, we use the ideas of breadth-first and depth-first traversals of the recursion trees, which were first considered by Kumar et al [25] and Ballard et al [3] for minimizing memory footprint and communication.…”

Section: Related Workmentioning

confidence: 99%

“…For our parallel algorithms in Section 4, we use the ideas of breadth-first and depth-first traversals of the recursion trees, which were first considered by Kumar et al [25] and Ballard et al [3] for minimizing memory footprint and communication.…”

Section: Related Workmentioning

confidence: 99%

A framework for practical parallel fast matrix multiplication

Benson

Ballard

2015

Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

View full text Add to dashboard Cite

Matrix multiplication is a fundamental computation in many scientific disciplines. In this paper, we show that novel fast matrix multiplication algorithms can significantly outperform vendor implementations of the classical algorithm and Strassen's fast algorithm on modest problem sizes and shapes. Furthermore, we show that the best choice of fast algorithm depends not only on the size of the matrices but also the shape. We develop a code generation tool to automatically implement multiple sequential and shared-memory parallel variants of each fast algorithm, including our novel parallelization scheme. This allows us to rapidly benchmark over 20 fast algorithms on several problem sizes. Furthermore, we discuss a number of practical implementation issues for these algorithms on shared-memory machines that can direct further research on making fast algorithms practical.

show abstract

“…The resulting algorithm has sequential complexity (n ω ), where ω = log N R. A BSP version of the algorithm was proposed in [12] (see also [9], [5], and [6]). The recursion tree is computed in breadth-first order.…”

Section: Fast Matrix Multiplicationmentioning

confidence: 99%

Memory-Efficient Matrix Multiplication in the BSP Model

McColl

Tiskin

1999

Algorithmica

View full text Add to dashboard Cite

The model of bulk-synchronous parallel (BSP) computation is an emerging paradigm of generalpurpose parallel computing. Its modification, the BSPRAM model, allows one to combine the advantages of distributed and shared-memory style programming. In this paper we study the BSP memory complexity of matrix multiplication. We propose new memory-efficient BSP algorithms both for standard and for fast matrix multiplication. The BSPRAM model is used to simplify the description of the algorithms. The communication and synchronization complexity of our algorithms is slightly higher than that of known time-efficient BSP algorithms. The current time-efficient and new memory-efficient algorithms are connected by a continuous tradeoff. Introduction.The model of bulk-synchronous parallel (BSP) computation (see [16], [10], [11], and [13]) provides a simple and practical framework for general-purpose parallel computing. Its main goal is to support the creation of architecture-independent and scalable parallel software. The key features of BSP are the treatment of the communication medium as an abstract fully connected network, and explicit and independent costing of communication and synchronization.Originally BSP was defined as a distributed memory model with point-to-point communication between the processors. In [15] the BSPRAM model-a variant of BSP based on a mixture of shared and distributed memory-was proposed. Paper [15] also identified some properties of a BSPRAM algorithm that suffice for its optimal simulation in BSP. Algorithms possessing at least one of these properties-communication-obliviousness, high slackness, high granularity-are abundant in scientific and industrial computing.The efficiency of many parallel applications is constrained by a limited amount of available memory. In this paper we extend the BSPRAM model to account for memory efficiency of a computation. We present new BSPRAM algorithms for matrix multiplication, considering both the standard method with sequential time complexity (n 3 ), and fast Strassen-type methods with sequential time complexity (n ω ), ω < 3. The new algorithms achieve a better memory performance than the McColl-Valiant time-efficient algorithm from [11] and [13] for the standard matrix multiplication, or the time-efficient algorithm from [12] for fast matrix multiplication. Communication and synchronization

show abstract

A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction

Cited by 22 publications

References 9 publications

Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination

Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination

A framework for practical parallel fast matrix multiplication

Memory-Efficient Matrix Multiplication in the BSP Model

Contact Info

Product

Resources

About