Scaling Correlated Fragment Molecular Orbital Calculations on Summit

Self Cite

The primary focus of GAMESS over the last 5 years has been the development of new high-performance codes that are able to take effective and efficient advantage of the most advanced computer architectures, both CPU and accelerators. These efforts include employing density fitting and fragmentation methods to reduce the high scaling of well-correlated (e.g., coupled-cluster) methods as well as developing novel codes that can take optimal advantage of graphical processing units and other modern accelerators. Because accurate wave functions can be very complex, an important new functionality in GAMESS is the quasi-atomic orbital analysis, an unbiased approach to the understanding of covalent bonds embedded in the wave function. Best practices for the maintenance and distribution of GAMESS are also discussed.

Section: Graphical Processing Unitsmentioning

confidence: 99%

The General Atomic and Molecular Electronic Structure System (GAMESS): Novel Methods on Novel Architectures

Zahariev,

Xu,

Westheimer

et al. 2023

Self Cite

“…There have since been several high-performance GPU accelerated RI-MP2 implementations in various software packages, 77,133 including those by some of the present authors. 80,81 Our implementation, which achieves linear scaling with system size through usage of molecular fragmentation, enabled us to perform RI-MP2 energy calculations using the cc-pVDZ/cc-pVDZ-RIFIT basis sets on over 145 000 atoms within ∼40 min, using ∼27 000 GPUs on the Summit supercomputer at the Oak Ridge National Laboratory. 81 While numerous efficient CPU-based MP2 gradient algorithms and implementations have been developed, in the literature to date, there have only been two attempts to use GPUs to accelerate the MP2 or RI-MP2 gradients.…”

Section: Introductionmentioning

confidence: 99%

“…While these methods hold considerable promise, their practical application to large molecules is hindered by the steep scriptO ( N 5 ) computational scaling of the underlying MP2 calculations. Consequently, there has been tremendous research effort over recent decades on devising faster and more efficient algorithms and software for the evaluation of the MP2 energy ,− and gradients. ,,,,− …”

Section: Introductionmentioning

confidence: 99%

“…Additional efforts were systematically undertaken to reduce the steep computational scaling of MP2 energies and gradients, thereby enabling their application to larger molecular systems. Therefore, numerous lower-order scaling algorithms were developed, offering accurate approximations for both the SS- and OS-MP2 energy components at a substantially reduced computational expense. ,,− , These methods primarily reduce the scaling order with system size by leveraging the local nature of electronic correlation, employing strategies such as orbital localization, ,,, atomic-level truncation and exploitation of sparsity in matrix elements, ,,,,,, or molecular fragmentation. ,,,,, …”

Section: Introductionmentioning

confidence: 99%

“…62,64,[66][67][68][69][70][71][72][73]107 These methods primarily reduce the scaling order with system size by leveraging the local nature of electronic correlation, employing strategies such as orbital localization, 62,64,67,70 atomic-level truncation and exploitation of sparsity in matrix elements, 66,69,70,72,73,75,107 or molecular fragmentation. 68,71,76,77,80,81 Another, less-explored pathway to significantly accelerate these calculations is by redesigning the underpinning algorithms to harness the massively parallel nature of modern computing hardware. A major paradigm shift in this hardware has occurred over the past decade with the widespread adoption of heterogeneous architectures.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

High-Performance Multi-GPU Analytic RI-MP2 Energy Gradients

Stocks,

Palethorpe,

Barca

2024

Self Cite

This article presents a novel algorithm for the calculation of analytic energy gradients from second-order Møller− Plesset perturbation theory within the Resolution-of-the-Identity approximation (RI-MP2), which is designed to achieve high performance on clusters with multiple graphical processing units (GPUs). The algorithm uses GPUs for all major steps of the calculation, including integral generation, formation of all required intermediate tensors, solution of the Z-vector equation and gradient accumulation. The implementation in the EXtreme Scale Electronic Structure System (EXESS) software package includes a tailored, highly efficient, multistream scheduling system to hide CPU-GPU data transfer latencies and allows nodes with 8 A100 GPUs to operate at over 80% of theoretical peak floating-point performance. Comparative performance analysis shows a significant reduction in computational time relative to traditional multicore CPU-based methods, with our approach achieving up to a 95-fold speedup over the single-node performance of established software such as Q-Chem and ORCA. Additionally, we demonstrate that pairing our implementation with the molecular fragmentation framework in EXESS can drastically lower the computational scaling of RI-MP2 gradient calculations from quintic to subquadratic, enabling further substantial savings in runtime while retaining high numerical accuracy in the resulting gradients.

An Efficient RI-MP2 Algorithm for Distributed Many-GPU Architectures

Snowdon,

Barca

2024

Second-order Møller−Plesset perturbation theory (MP2) using the Resolution of the Identity approximation (RI-MP2) is a widely used method for computing molecular energies beyond the Hartree−Fock mean-field approximation. However, its high computational cost and lack of efficient algorithms for modern supercomputing architectures limit its applicability to large molecules. In this paper, we present the first distributed-memory many-GPU RI-MP2 algorithm explicitly designed to utilize hundreds of GPU accelerators for every step of the computation. Our novel algorithm achieves near-peak performance on GPU-based supercomputers through the development of a distributed memory algorithm for forming RI-MP2 intermediate tensors with zero internode communication, except for a single N ( ) 2 asynchronous broadcast, and a distributed memory algorithm for the N ( ) 5 energy reduction step, capable of sustaining near-peak performance on clusters with several hundred GPUs. Comparative analysis shows our implementation outperforms state-of-the-art quantum chemistry software by over 3.5 times in speed while achieving an 8-fold reduction in computational power consumption. Benchmarking on the Perlmutter supercomputer, our algorithm achieves 11.8 PFLOP/s (83% of peak performance) performing and the RI-MP2 energy calculation on a 314-water cluster with 7850 primary and 30,144 auxiliary basis functions in 4 min on 180 nodes and 720 A100 GPUs. This performance represents a substantial improvement over traditional CPU-based methods, demonstrating significant time-to-solution and power consumption benefits of leveraging modern GPU-accelerated computing environments for quantum chemistry calculations.