1997
DOI: 10.1002/(sici)1096-9128(199704)9:4<255::aid-cpe250>3.0.co;2-2
|View full text |Cite
|
Sign up to set email alerts
|

SUMMA: scalable universal matrix multiplication algorithm

Abstract: In this paper, we give a straight forward, highly e cient, scalable implementation of common matrix multiplication operations. The algorithms are much simpler than previously published methods, yield better performance, and require less work space. MPI implementations are given, as are p erformance r esults on the Intel Paragon system.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
255
0
1

Year Published

2004
2004
2021
2021

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 367 publications
(262 citation statements)
references
References 14 publications
1
255
0
1
Order By: Relevance
“…In tensor contractions, the data locality is used such that MPI Raccumulate is intra-node while MPI Rget can be inter-node; we made this decision because MPI Raccumulate is typically not implemented at the hardware level unlike MPI Rget and MPI Rput. The index permutation of tensors is currently performed at the destination; further optimization using a scalable universal matrix multiplication algorithm (SUMMA) 29,30 to avoid the repeated permutation operations will be performed in the future.…”
Section: F Code Generator and Parallelizationmentioning
confidence: 99%
“…In tensor contractions, the data locality is used such that MPI Raccumulate is intra-node while MPI Rget can be inter-node; we made this decision because MPI Raccumulate is typically not implemented at the hardware level unlike MPI Rget and MPI Rput. The index permutation of tensors is currently performed at the destination; further optimization using a scalable universal matrix multiplication algorithm (SUMMA) 29,30 to avoid the repeated permutation operations will be performed in the future.…”
Section: F Code Generator and Parallelizationmentioning
confidence: 99%
“…The most widely-used algorithm for parallel matrix multiplication is SUMMA [31], which perfectly load-balances the flops for any matrix dimension, but is only communicationoptimal for certain matrix dimensions or if assuming no extra memory. For square matrix multiplication, communication cost lower bounds have been proved [22], [5], [2], suggesting that known 2D algorithms (such as SUMMA) and 3D algorithms [7], [1] are only optimal in certain memory ranges.…”
Section: Introductionmentioning
confidence: 99%
“…Tensor contractions are split into redistribution and contraction phases, where the former permutes the dimensions such that the latter can be done by using a matrix-matrix multiplication algorithm such as SUMMA [45]. Because CTF uses a cyclic data decomposition, load imbalance is eliminated, at least for dense contractions.…”
Section: Related Workmentioning
confidence: 99%