Clustered Low-Rank Tensor Format: Introduction and Application to Fast Construction of Hartree–Fock Exchange

Lewis, Cannada Andrew; Calvin, Justus A.; Valeev, Edward F.

doi:10.1021/acs.jctc.6b00884

Cited by 25 publications

(21 citation statements)

References 83 publications

(184 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other lower-level programming styles, such as functional-style iteration over tensor blocks to explicit loops over tile indices and direct byte-level access to the data, are also supported to provide experts with the ability to compose arbitrary algorithms over general sparse tensorial data structures. 213 TA has been designed to support efficient execution on modern and future hardware of all scales, from a single multi-core machine to a cluster of multi-core, multi-GPU nodes, to leadership-class supercomputers. To maximize the concurrency and hide latency, which is crucial for alleviating the load imbalance and lower computation-to-communication ratio of the irregular sparse tensor algebra, TA has an asynchronous, dataflow-style core.…”

Section: Tiledarraymentioning

confidence: 99%

From NWChem to NWChemEx: Evolving with the Computational Chemistry Landscape

et al. 2021

Self Cite

View full text Add to dashboard Cite

Since the advent of the first computers, chemists have been at the forefront of using computers to understand and solve complex chemical problems. As the hardware and software have evolved, so have the theoretical and computational chemistry methods and algorithms. Parallel computers clearly changed the common computing paradigm in the late 1970s and 80s, and the field has again seen a paradigm shift with the advent of graphical processing units. This review explores the challenges and some of the solutions in transforming software from the terascale to the petascale and now to the upcoming exascale computers. While discussing the field in general, NWChem and its redesign, NWChemEx, will be highlighted as one of the early co-design projects to take advantage of massively parallel computers and emerging software standards to enable large scientific challenges to be tackled.

show abstract

Section: Tiledarraymentioning

confidence: 99%

From NWChem to NWChemEx: Evolving with the Computational Chemistry Landscape

et al. 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…The ABCD term was evaluated using the AO-based formalism [26]. The input tensor T representing its initial state in the coupled-cluster simulation was evaluated in AO basis using the Laplace transform approximation, with the occupied orbitals localized and both occupied and AO basis clustered to group spatially-close orbitals together [29]; the clustering defines tiling of the corresponding index ranges. The CPU-only implementation in MPQC evaluates tensor V on the fly, as needed; due to the lack of publicly-available efficient kernels for direct evaluation of AO integrals on GPUs (such kernels are under development by some of us) the GPU benchmarks used blocksparse V with the actual sparsity pattern determined by the CPU-only code but the tiles filled with random data.…”

Section: Practical Example: Evaluation Of the Abcd Coupledcluster Tensor Contraction For Molecule C 65 H 132mentioning

confidence: 99%

“…To evaluate the impact of the tiling on performance, we consider three representative tilings of the index ranges. Since the k-means-based clustering algorithm that determines the range tilings is quasirandom [29] and cannot ensure uniform tiling (this would necessarily violate locality in all practical applications), these tilings are generated by specifying the target number of clusters for each index range. Table 1 summarizes the difference between the three different tilings, from the most fine-grained one (v 1 ) to the most coarse-grained one (v 3 ).…”

Section: Practical Example: Evaluation Of the Abcd Coupledcluster Tensor Contraction For Molecule C 65 H 132mentioning

confidence: 99%

Distributed-memory multi-GPU block-sparse tensor contraction for electronic structure

Hérault

Robert

Bosilca

et al. 2021

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Self Cite

View full text Add to dashboard Cite

Many domains of scientific simulation (chemistry, condensed matter physics, data science) increasingly eschew dense tensors for block-sparse tensors, sometimes with additional structure (recursive hierarchy, rank sparsity, etc.). Distributed-memory parallel computation with block-sparse tensorial data is paramount to minimize the time-tosolution (e.g., to study dynamical problems or for real-time analysis) and to accommodate problems of realistic size that are too large to fit into the host/device memory of a single node equipped with accelerators. Unfortunately, computation with such irregular data structures is a poor match to the dominant imperative, bulk-synchronous parallel programming model. In this paper, we focus on the critical element of block-sparse tensor algebra, namely binary tensor contraction, and report on an efficient and scalable implementation using the task-focused PaRSEC runtime. High performance of the block-sparse tensor contraction on the Summit supercomputer is demonstrated for synthetic data as well as for real data involved in electronic structure simulations of unprecedented size.

show abstract

“…To recover data sparsity in tensors appearing in quantum physics applications we developed the Clustered Low-Rank (CLR) representation [23] that is a general, hierarchy-free compressed tensor format. In this representation, each MIJ block of matrix M is approximated by a low-rank decomposition of the form MIJ ≈ XW † , where for a given MIJ ∈ R m×n of rank r, X is m × r and W is n × r. X and W were constructed from a rank-revealing QR decomposition, [26] MIJ P = QR.…”

Section: Clustered Low-rank Representationmentioning

confidence: 99%

Scalable task-based algorithm for multiplication of block-rank-sparse matrices

Calvin

Lewis

Valeev

2015

Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms

Self Cite

View full text Add to dashboard Cite

A task-based formulation of Scalable Universal Matrix Multiplication Algorithm (SUMMA), a popular algorithm for matrix multiplication (MM), is applied to the multiplication of hierarchy-free, rank-structured matrices that appear in the domain of quantum chemistry (QC). The novel features of our formulation are: (1) concurrent scheduling of multiple SUMMA iterations, and (2) fine-grained task-based composition. These features make it tolerant of the load imbalance due to the irregular matrix structure and eliminate all artifactual sources of global synchronization. Scalability of iterative computation of square-root inverse of block-ranksparse QC matrices is demonstrated; for full-rank (dense) matrices the performance of our SUMMA formulation usually exceeds that of the state-of-the-art dense MM implementations (ScaLAPACK and Cyclops Tensor Framework).1 Related matrix data structures have appeared under many names (matrices with decay, H-matrices, rank-structured matrices, and mosaic skeleton approximation), but no single globally-accepted terminology exists. For the history of these types of matrices see Ref [37]. arXiv:1509.00309v2 [cs.DC] 9 Oct 2015 nication costs can be partially or fully hidden by overlapping computation and communication, (b) performance should be less sensitive to topology, latency, and CPU clock variations, (c) fine-grained, task-based parallelism is a proven means to attain high intra-node performance by leveraging massively multicore platforms and hiding the costs of memory hierarchy (e.g. Intel TBB, Cilk), (d) lack of global synchronization allows the overlap multiple, high-level stages of computation (e.g. two or more multiple matrix multiplications contributing to the same expression).The new formulation was used to implement iterative computation of the square root inverse of a matrix, a prototypical operation in which block ranks of intermediate matrices change dynamically during the iteration. The usual advantage of the task formulation, tolerance of load imbalance and latency, are demonstrated in the regime where matrices approach full rank, by comparison against the state-of-the-art dense MM implementations.

show abstract

Clustered Low-Rank Tensor Format: Introduction and Application to Fast Construction of Hartree–Fock Exchange

Cited by 25 publications

References 83 publications

From NWChem to NWChemEx: Evolving with the Computational Chemistry Landscape

From NWChem to NWChemEx: Evolving with the Computational Chemistry Landscape

Distributed-memory multi-GPU block-sparse tensor contraction for electronic structure

Scalable task-based algorithm for multiplication of block-rank-sparse matrices

Contact Info

Product

Resources

About