Edgar Solomonik scite author profile

Abstract. One can use extra memory to parallelize matrix multiplication by storing p 1/3 redundant copies of the input matrices on p processors in order to do asymptotically less communication than Cannon's algorithm [2], and be faster in practice [1]. We call this algorithm "3D" because it arranges the p processors in a 3D array, and Cannon's algorithm "2D" because it stores a single copy of the matrices on a 2D array of processors. We generalize these 2D and 3D algorithms by introducing a new class of "2.5D algorithms". For matrix multiplication, we can take advantage of any amount of extra memory to store c copies of the data, for any c ∈ {1, 2, ..., p 1/3 }, to reduce the bandwidth cost of Cannon's algorithm by a factor of c 1/2 and the latency cost by a factor c 3/2 . We also show that these costs reach the lower bounds [13,3], modulo polylog(p) factors. We similarly generalize LU decomposition to 2.5D and 3D, including communication-avoiding pivoting, a stable alternative to partial-pivoting [7]. We prove a novel lower bound on the latency cost of 2.5D and 3D LU factorization, showing that while c copies of the data can also reduce the bandwidth by a factor of c 1/2 , the latency must increase by a factor of c 1/2 , so that the 2D LU algorithm (c = 1) in fact minimizes latency. Preliminary results of 2.5D matrix multiplication on a Cray XT4 machine also demonstrate a performance gain of up to 3X with respect to Cannon's algorithm. Careful choice of c also yields up to a 2.4X speed-up over 3D matrix multiplication, due to a better balance between communication costs.

show abstract

A massively parallel tensor contraction framework for coupled-cluster computations

Solomonik

Matthews

Hammond

et al. 2014

Journal of Parallel and Distributed Computing

176

164

View full text Add to dashboard Cite

Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions

Solomonik

Matthews

Hammond

et al. 2013

View full text Add to dashboard Cite

Abstract-Cyclops (cyclic-operations) Tensor Framework (CTF)1 is a distributed library for tensor contractions. CTF aims to scale high-dimensional tensor contractions such as those required in the Coupled Cluster (CC) electronic structure method to massively-parallel supercomputers. The framework preserves tensor structure by subdividing tensors cyclically, producing a regular parallel decomposition. An internal virtualization layer provides completely general mapping support while maintaining ideal load balance. The mapping framework decides on the best mapping for each tensor contraction at run-time via explicit calculations of memory usage and communication volume. CTF employs a general redistribution kernel, which transposes tensors of any dimension between arbitrary distributed layouts, yet touches each piece of data only once. Sequential symmetric contractions are reduced to matrix multiplication calls via tensor index transpositions and partial unpacking. The user-level interface elegantly expresses arbitrary-dimensional generalized tensor contractions in the form of a domain specific language. We demonstrate performance of CC with single and double excitations on BlueGene/Q and Cray XE6 supercomputers.

show abstract

Pareto-Efficient Quantum Circuit Simulation Using Tensor Contraction Deferral

Pednault¹,

Gunnels²,

Nannicini³

et al. 2017

Preprint

View full text Add to dashboard Cite

Highly scalable parallel sorting

Solomonik

Kalé

2010

View full text Add to dashboard Cite

Abstract-Sorting is a commonly used process with a wide breadth of applications in the high performance computing field. Early research in parallel processing has provided us with comprehensive analysis and theory for parallel sorting algorithms. However, modern supercomputers have advanced rapidly in size and changed significantly in architecture, forcing new adaptations to these algorithms. To fully utilize the potential of highly parallel machines, tens of thousands of processors are used. Efficiently scaling parallel sorting on machines of this magnitude is inhibited by the communication-intensive problem of migrating large amounts of data between processors. The challenge is to design a highly scalable sorting algorithm that uses minimal communication, maximizes overlap between computation and communication, and uses memory efficiently. This paper presents a scalable extension of the Histogram Sorting method, making fundamental modifications to the original algorithm in order to minimize message contention and exploit overlap. We implement Histogram Sort, Sample Sort, and Radix Sort in CHARM++ and compare their performance. The choice of algorithm as well as the importance of the optimizations is validated by performance tests on two predominant modern supercomputer architectures: XT4 at ORNL (Jaguar) and Blue Gene/P at ANL (Intrepid).

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Edgar Solomonik

Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms

A massively parallel tensor contraction framework for coupled-cluster computations

Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions

Pareto-Efficient Quantum Circuit Simulation Using Tensor Contraction Deferral

Highly scalable parallel sorting

Contact Info

Product

Resources

About