A massively parallel tensor contraction framework for coupled-cluster computations

Solomonik, Edgar; Matthews, Devin A.; Hammond, Jeff R.; Stanton, John F.; Demmel, James

doi:10.1016/j.jpdc.2014.06.002

Cited by 176 publications

(167 citation statements)

References 42 publications

Supporting

Mentioning

167

Contrasting

Order By: Relevance

“…Prior analysis [9] shows that a tensor contraction could be defined as a generalized SUMMA and the communication volume is asymptotically the same between CTF and NWChem, while we show that data movement is influenced by the runtime and programming model. SUMMA is also shown to be communication optimal assuming no extra memory is used to apply communication-avoiding optimizations.…”

Section: Data Movement Analysis Through a Proxy Benchmarkmentioning

confidence: 78%

“…CTF and TiledArray provide a general programming interface making them open to applications outside of electronic structure theory. A notable example of using CTF in quantum chemistry computations is provided by the Aquarius package [9,10].…”

Section: Related Workmentioning

confidence: 99%

“…SUMMA is also shown to be communication optimal assuming no extra memory is used to apply communication-avoiding optimizations. Both CTF [9] and NWChem [12] use adaptations of the SUMMA. They apply communication-avoiding techniques: CTF algorithmically applies partial data replication (2.5D algorithm) and global-array implementations could apply runtime block caching.…”

Section: Data Movement Analysis Through a Proxy Benchmarkmentioning

confidence: 99%

“…They apply communication-avoiding techniques: CTF algorithmically applies partial data replication (2.5D algorithm) and global-array implementations could apply runtime block caching. Considering the data needed to carry out the computation, Solomonik analysis [9] shows that the communication lower bounds are asymptotically the same for both implementations, given an appropriate choice of a tiling size, which influences the amount of data padding and memory used.…”

Section: Data Movement Analysis Through a Proxy Benchmarkmentioning

confidence: 99%

“…The "owner computes" rule is used, but if data is replicated, then MPI reduction is used to have optimized accumulation of the results. The CTF authors introduced a communication-avoiding technique (2.5D [28,9]) to reduce the communication by redundantly copying the blocks of data to different nodes depending on the availability of memory. Figure 9 shows the volume of communicated data per node for the various implementations for different processes per node.…”

Section: Analytical Estimate Of the Data Movement Of The Proxy Benchmarkmentioning

confidence: 99%

See 4 more Smart Citations

Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends

Ibrahim

Epifanovsky

Williams

et al. 2017

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Coupled-cluster methods provide highly accurate models of molecular structure through explicit numerical calculation of tensors representing the correlation between electrons. These calculations are dominated by a sequence of tensor contractions, motivating the development of numerical libraries for such operations. While based on matrix-matrix multiplication, these libraries are specialized to exploit symmetries in the molecular structure and in electronic interactions, and thus reduce the size of the tensor representation and the complexity of contractions. The resulting algorithms are irregular and their parallelization has been previously achieved via the use of dynamic scheduling or specialized data decompositions. We introduce our efforts to extend the Libtensor framework to work in the distributed memory environment in a scalable and energy-efficient manner. We achieve up to 240× speedup compared with the optimized shared memory implementation of Libtensor. We attain scalability to hundreds of thousands of compute cores on three distributedmemory architectures, (Cray XC30 and XC40, and IBM Blue Gene/Q), and on a heterogeneous GPU-CPU system (Cray XK7). As the bottlenecks shift from being compute-bound DGEMM's to communication-bound collectives as the size of the molecular system scales, we adopt two radically different parallelization approaches for handling load-imbalance, tasking and bulk synchronous models. Nevertheless, we preserve a unified interface to both programming models to maintain the productivity of computational quantum chemists.

show abstract

Section: Data Movement Analysis Through a Proxy Benchmarkmentioning

confidence: 78%

Section: Related Workmentioning

confidence: 99%

Section: Data Movement Analysis Through a Proxy Benchmarkmentioning

confidence: 99%

Section: Data Movement Analysis Through a Proxy Benchmarkmentioning

confidence: 99%

Section: Analytical Estimate Of the Data Movement Of The Proxy Benchmarkmentioning

confidence: 99%

See 3 more Smart Citations

Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends

Ibrahim

Epifanovsky

Williams

et al. 2017

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

show abstract

Machine‐learning assisted scheduling optimization and its application in quantum chemical calculations

Chen

et al. 2023

J Comput Chem

View full text Add to dashboard Cite

Easy and effective usage of computational resources is crucial for scientific calculations, both from the perspectives of timeliness and economic efficiency. This work proposes a bi-level optimization framework to optimize the computational sequences.Machine-learning (ML) assisted static load-balancing, and different dynamic load-balancing algorithms can be integrated. Consequently, the computational and scheduling engine of the PARAENGINE is developed to invoke optimized quantum chemical (QC) calculations. Illustrated benchmark calculations include highthroughput drug suit, solvent model, P38 protein, and SARS-CoV-2 systems. The results show that the usage rate of given computational resources for high throughput and large-scale fragmentation QC calculations can primarily profit, and faster accomplishing computational tasks can be expected when employing highperformance computing (HPC) clusters.

show abstract

Accelerating alternating least squares for tensor decomposition by pairwise perturbation

Solomonik

2022

Numerical Linear Algebra App

Self Cite

View full text Add to dashboard Cite

The alternating least squares (ALS) algorithm for CP and Tucker decomposition is dominated in cost by the tensor contractions necessary to set up the quadratic optimization subproblems. We introduce a novel family of algorithms that uses perturbative corrections to the subproblems rather than recomputing the tensor contractions. This approximation is accurate when the factor matrices are changing little across iterations, which occurs when ALS approaches convergence. We provide a theoretical analysis to bound the approximation error.Our numerical experiments demonstrate that the proposed pairwise perturbation algorithms are easy to control and converge to minima that are as good as ALS. The experimental results show improvements of up to 3.1× with respect to state-of-the-art ALS approaches for various model tensor problems and real datasets.

show abstract

A massively parallel tensor contraction framework for coupled-cluster computations

Cited by 176 publications

References 42 publications

Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends

Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends

Machine‐learning assisted scheduling optimization and its application in quantum chemical calculations

Accelerating alternating least squares for tensor decomposition by pairwise perturbation

Contact Info

Product

Resources

About