Thread-level parallelization and optimization of NWChem for the Intel MIC architecture

Shan, Hongzhang; Williams, Samuel; Jong, Wibe A. de; Oliker, Leonid

doi:10.1145/2712386.2712391

Cited by 10 publications

(7 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The approach was tested in the FMO and Community Earth System Model (CESM) packages. Shan et al (2014Shan et al ( , 2015 used OpenMP task parallelism to HF SCF and CCSD(T) drivers.…”

Section: Related Workmentioning

confidence: 99%

An efficient MPI/OpenMP parallelization of the Hartree–Fock–Roothaan method for the first generation of Intel® Xeon Phi™ processor architecture

Mironov

Moskovsky²,

D‘Mello

et al. 2017

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

The Hartree-Fock method in the General Atomic and Molecular Structure System (GAMESS) quantum chemistry package represents one of the most irregular algorithms in computation today. Major steps in the calculation are the irregular computation of electron repulsion integrals and the building of the Fock matrix. These are the central components of the main self consistent field (SCF) loop, the key hot spot in electronic structure codes. By threading the Message Passing Interface (MPI) ranks in the official release of the GAMESS code, we not only speed up the main SCF loop (43 to 63 for large systems) but also achieve a significant (.23) reduction in the overall memory footprint. These improvements are a direct consequence of memory access optimizations within the MPI ranks. We benchmark our implementation against the official release of the GAMESS code on the Intel â Xeon Phiä supercomputer. Scaling numbers are reported on up to 7680 cores on Intel Xeon Phi coprocessors.

show abstract

“…The approach was tested in the FMO and Community Earth System Model (CESM) packages. Shan et al (2014Shan et al ( , 2015 used OpenMP task parallelism to HF SCF and CCSD(T) drivers.…”

Section: Related Workmentioning

confidence: 99%

An efficient MPI/OpenMP parallelization of the Hartree–Fock–Roothaan method for the first generation of Intel® Xeon Phi™ processor architecture

Mironov

Moskovsky²,

D‘Mello

et al. 2017

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

show abstract

“…While in some cases, the addition of OpenMP threads improves performance, neither NWChem nor the Global Arrays toolkit are completely thread safe. The use of OpenMP with NWChem CCSD(T) calculations was shown to improve performance, but extensive changes were required in every routine accessed during the calculations, and even variables within nested loops had to be updated.…”

Section: Introductionmentioning

confidence: 99%

Improving efficiency of semi‐direct møller–plesset second‐order perturbation methods through oversubscription on multiple nodes

et al. 2019

View full text Add to dashboard Cite

The purpose of this work is to evaluate the efficacy of oversubscription, at the 1n, 2n, and 3n levels for n physical cores, on semi‐direct MP2 methods within NWChem when using two and three Intel nodes. Semi‐direct MP2 energy and gradient calculations were performed on chemical systems ranging from 824 to 1626 basis functions using the cc‐pVDZ basis set. Wall times for semi‐direct MP2 energies were reduced by as much as 36% using two nodes and 44% using three nodes compared to no oversubscription. Total energy consumed by the CPU and DRAM was also reduced by as much as 12% using two nodes and as much as 20% using three nodes when oversubscribing. MP2 gradient wall times improved by as much as 16% using two nodes and 18% using three nodes compared to execution at the 1n level; however, energy savings were insignificant. Intel performance‐counter data show a strong correlation between total wall time saved and less time spent in the idle state, indicating a more efficient use of the processors when oversubscribing. © 2019 Wiley Periodicals, Inc.

show abstract

“…Moreover, in [34], the authors discussed the optimization of NWChem for Intel's MIC architecture and highlighted the need for tensor computations of about 200-2,000 matrices from 10 × 10 to 40 × 40 in size. In his dissertation, David Ozog discussed NWChem's Tensor Contraction Engine (TCE) and revealed how strongly it relies on the performance of general matrix-matrix multiplication (GEMM) in the computation of the tensor contraction.…”

Section: Introductionmentioning

confidence: 99%

A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations

Haidar

Abdelfattah

Zounon

et al. 2018

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Abstract-We present a high-performance GPU kernel with a substantial speedup over vendor libraries for very small matrix computations. In addition, we discuss most of the challenges that hinder the design of efficient GPU kernels for small matrix algorithms. We propose relevant algorithm analysis to harness the full power of a GPU, and strategies for predicting the performance, before introducing a proper implementation. We develop a theoretical analysis and a methodology for high-performance linear solvers for very small matrices. As test cases, we take the Cholesky and LU factorizations and show how the proposed methodology enables us to achieve a performance close to the theoretical upper bound of the hardware. This work investigates and proposes novel algorithms for designing highly optimized GPU kernels for solving batches of hundreds of thousands of small-size Cholesky and LU factorizations. Our focus on efficient batched Cholesky and batched LU kernels is motivated by the increasing need for these kernels in scientific simulations (e.g., astrophysics applications). Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. The proposed GPU kernels achieve performance speedups vs. CUBLAS of up to 6× for the factorizations, using double precision arithmetic on an NVIDIA Pascal P100 GPU.

show abstract

Thread-level parallelization and optimization of NWChem for the Intel MIC architecture

Cited by 10 publications

References 19 publications

An efficient MPI/OpenMP parallelization of the Hartree–Fock–Roothaan method for the first generation of Intel® Xeon Phi™ processor architecture

An efficient MPI/OpenMP parallelization of the Hartree–Fock–Roothaan method for the first generation of Intel® Xeon Phi™ processor architecture

Improving efficiency of semi‐direct møller–plesset second‐order perturbation methods through oversubscription on multiple nodes

A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations

Contact Info

Product

Resources

About