1994
DOI: 10.1002/cpe.4330060703
|View full text |Cite
|
Sign up to set email alerts
|

Matrix multiplication on the Intel Touchstone Delta

Abstract: SUMMARYMatrix multiplication is a key primitive in block matrix algorithms such as those found in LAPACK. We present results from our study of matrix multiplication algorithms on the Intel Touchstone Delta, a distributed memory message-passing architecture with a two-dimensional mesh topology. W e analyze and compare three algorithms and obtain an implementation, BiMMeR, that uses communication primitives highly suited to the Delta and exploits the single node assembly-coded matrix multiplication. Our algorith… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
18
0

Year Published

1994
1994
2017
2017

Publication Types

Select...
6

Relationship

0
6

Authors

Journals

citations
Cited by 31 publications
(18 citation statements)
references
References 14 publications
0
18
0
Order By: Relevance
“…However, it is competitive, or faster, and, given its simplicity and flexibility, warrants consideration. Moreover, the implementations by Huss-Lederman et al [7,8] are competitive with PUMMA, and would thus compare similarly with SUMMA. Also, our method is presented in a slightly simplified setting and thus the performance may be slightly better than it would be if we implemented exactly for the cases for which PUMMA and the algorithm by Huss-Lederman et al were designed.…”
Section: Resultsmentioning
confidence: 86%
See 1 more Smart Citation
“…However, it is competitive, or faster, and, given its simplicity and flexibility, warrants consideration. Moreover, the implementations by Huss-Lederman et al [7,8] are competitive with PUMMA, and would thus compare similarly with SUMMA. Also, our method is presented in a slightly simplified setting and thus the performance may be slightly better than it would be if we implemented exactly for the cases for which PUMMA and the algorithm by Huss-Lederman et al were designed.…”
Section: Resultsmentioning
confidence: 86%
“…Two recent efforts extend the work by Fox et al to general meshes of nodes: the paper by Choi et al [6] uses a two-dimensional block-wrapped (block-cyclic) data decomposition, while the papers by Huss-Lederman et al [7,8] use a 'virtual' 2-D torus wrap data layout. Both these efforts report very good performance attained on the Intel Touchstone Delta, achieving a sizeable percentage of peak performance.…”
Section: Introductionmentioning
confidence: 97%
“…By way of contrast, BiMMeR's BMR algorithm [7] uses a different approach to extend Fox's algorithm to non-square grids. The data layout is flexible and the virtual 2D torus wrap data distribution is recommended [7]. At present, however, BiMMeR's BMR algorithm only deals with square matrices.…”
Section: Parallel Dense Matrix Multiplication Algorithmsmentioning
confidence: 98%
“…All these algorithms use different approaches to extend Fox's algorithm to deal with non-square grids [11]. Specifically, MM 5 is a generalized version of BiMMeR's BMR algorithm [7] and can deal with non-square matrices; MM 3 and MM 4 are completely new algorithms. The third category is the Broadcast-Broadcast approach and a new algorithm, BB, is detailed.…”
Section: Parallel Dense Matrix Multiplication Algorithmsmentioning
confidence: 99%
“…Despite the loss of interest in systolic arrays per se, systolic principles lead to a series of efficient algorithms for general-purpose computer systems, the prime example being a series of algorithms for matrix multiplication, including: Cannon's [10], Fox's [26], BiMMeR [30], PUMMA [16], SUMMA [58], and DIMMA [15].…”
Section: Introductionmentioning
confidence: 99%