SUMMA: scalable universal matrix multiplication algorithm

Geijn, Robert A.; Watts, Jerrell

doi:10.1002/(sici)1096-9128(199704)9:4<255::aid-cpe250>3.0.co;2-2

Cited by 367 publications

(262 citation statements)

References 14 publications

Supporting

Mentioning

255

Contrasting

Unclassified

Order By: Relevance

“…In tensor contractions, the data locality is used such that MPI Raccumulate is intra-node while MPI Rget can be inter-node; we made this decision because MPI Raccumulate is typically not implemented at the hardware level unlike MPI Rget and MPI Rput. The index permutation of tensors is currently performed at the destination; further optimization using a scalable universal matrix multiplication algorithm (SUMMA) 29,30 to avoid the repeated permutation operations will be performed in the future.…”

Section: F Code Generator and Parallelizationmentioning

confidence: 99%

Nuclear Energy Gradients for Internally Contracted Complete Active Space Second-Order Perturbation Theory: Multistate Extensions

Vlaisavljevich

Shiozaki

2016

J. Chem. Theory Comput.

121

190

View full text Add to dashboard Cite

We report the development of the theory and computer program for analytical nuclear energy gradients for (extended) multi-state complete active space perturbation theory (CASPT2) with full internal contraction. The vertical shifts are also considered in this work. This is an extension of the fully internally contracted CASPT2 nuclear gradient program, recently developed for a state-specific variant by us [MacLeod and Shiozaki, J. Chem. Phys. 142, 051103 (2015)]; in this extension, the so-called λ equation is solved to account for the variation of the multi-state CASPT2 energies with respect to the change in the amplitudes obtained in the preceding statespecific CASPT2 calculations, and the Z-vector equations are modified accordingly. The program is parallelized using the MPI3 remote memory access protocol that allows us to perform efficient one-sided communication.The optimized geometries of the ground and excited states of a copper corrole and benzophenone are presented as numerical examples. The code is publicly available under the GNU General Public License.

show abstract

Section: F Code Generator and Parallelizationmentioning

confidence: 99%

Nuclear Energy Gradients for Internally Contracted Complete Active Space Second-Order Perturbation Theory: Multistate Extensions

Vlaisavljevich

Shiozaki

2016

J. Chem. Theory Comput.

121

190

View full text Add to dashboard Cite

show abstract

“…The most widely-used algorithm for parallel matrix multiplication is SUMMA [31], which perfectly load-balances the flops for any matrix dimension, but is only communicationoptimal for certain matrix dimensions or if assuming no extra memory. For square matrix multiplication, communication cost lower bounds have been proved [22], [5], [2], suggesting that known 2D algorithms (such as SUMMA) and 3D algorithms [7], [1] are only optimal in certain memory ranges.…”

Section: Introductionmentioning

confidence: 99%

Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication

Demmel

Eliahu

Fox

et al. 2013

2013 IEEE 27th International Symposium on Parallel and Distributed Processing

100

View full text Add to dashboard Cite

Abstract-Communication-optimal algorithms are known for square matrix multiplication. Here, we obtain the first communication-optimal algorithm for all dimensions of rectangular matrices. Combining the dimension-splitting technique of Frigo, Leiserson, Prokop and Ramachandran (1999) with the recursive BFS/DFS approach of Ballard, Demmel, Holtz, Lipshitz and Schwartz (2012) allows for a communication-optimal as well as cache-and network-oblivious algorithm. Moreover, the implementation is simple: approximately 50 lines of code for the shared-memory version. Since the new algorithm minimizes communication across the network, between NUMA domains, and between levels of cache, it performs well in practice on both shared-and distributed-memory machines. We show significant speedups over existing parallel linear algebra libraries both on a 32-core shared-memory machine and on a distributed-memory supercomputer.

show abstract

“…Tensor contractions are split into redistribution and contraction phases, where the former permutes the dimensions such that the latter can be done by using a matrix-matrix multiplication algorithm such as SUMMA [45]. Because CTF uses a cyclic data decomposition, load imbalance is eliminated, at least for dense contractions.…”

Section: Related Workmentioning

confidence: 99%

Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions

Ozog

Hammond

Dinan

et al. 2013

2013 42nd International Conference on Parallel Processing

View full text Add to dashboard Cite

Abstract-Developing effective yet scalable load-balancing methods for irregular computations is critical to the successful application of simulations in a variety of disciplines at petascale and beyond. This paper explores a set of static and dynamic scheduling algorithms for block-sparse tensor contractions within the NWChem computational chemistry code for different degrees of sparsity (and therefore load imbalance). In this particular application, a relatively large amount of task information can be obtained at minimal cost, which enables the use of static partitioning techniques that take the entire task list as input. However, fully static partitioning is incapable of dealing with dynamic variation of task costs, such as from transient network contention or operating system noise, so we also consider hybrid schemes that utilize dynamic scheduling within subgroups. These two schemes, which have not been previously implemented in NWChem or its proxies (i.e. quantum chemistry mini-apps) are compared to the original centralized dynamic load-balancing algorithm as well as improved centralized scheme. In all cases, we separate the scheduling of tasks from the execution of tasks into an inspector phase and an executor phase. The impact of these methods upon the application is substantial on a large InfiniBand cluster: execution time is reduced by as much as 50% at scale. The technique is applicable to any scientific application requiring load balance where performance models or estimations of kernel execution times are available.

show abstract

SUMMA: scalable universal matrix multiplication algorithm

Cited by 367 publications

References 14 publications

Nuclear Energy Gradients for Internally Contracted Complete Active Space Second-Order Perturbation Theory: Multistate Extensions

Nuclear Energy Gradients for Internally Contracted Complete Active Space Second-Order Perturbation Theory: Multistate Extensions

Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication

Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions

Contact Info

Product

Resources

About