Global‐view coefficients: a data management solution for parallel quantum Monte Carlo applications

Niu, Qingpeng; Dinan, James; Tirukkovalur, Sravya; Benali, Anouar; Kim, Jeongnim; Mitáš, Luboš; Wagner, Lucas K.; Sadayappan, P.

doi:10.1002/cpe.3748

Cited by 3 publications

(2 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Millions of CPU-hours are often invested in promising large-scale projects. ,,, It would be desirable to single out ways to reduce the related CPU costs to a maximum possible extent. This can be acomplished by algorithmic and methodological improvements − as well as by simplification of Ψ T .…”

Section: Introductionmentioning

confidence: 99%

Noncovalent Interactions by Fixed-Node Diffusion Monte Carlo: Convergence of Nodes and Energy Differences vs Gaussian Basis-Set Size

Dubecký

2017

J. Chem. Theory Comput.

View full text Add to dashboard Cite

Convergence of fixed-node (FN) shape and FN diffusion Monte Carlo (FNDMC) interaction energies is studied vs the Gaussian basis set saturation level in HF and CH dimers and one-determinant Slater-Jastrow trial wave functions (Ψ). The tested 25 distinct basis sets obtained by stepwise trimming of aug-VDZ and aug-VTZ bases suggest minimum basis set requirements to achieve reasonable results. A single selected trimmed basis set, about 2 times smaller in size than aug-VTZ, is extensively tested on a set of 12 noncovalent complexes including formic acid dimer, benzene-methane, or coronene-H. The results indicate that equivalent noncovalent FNDMC energy differences are available at costs lower than assumed before. Additional insights from electron density differences and comparison of dimer vs monomer Ψ nodes explain this observation.

show abstract

Section: Introductionmentioning

confidence: 99%

Noncovalent Interactions by Fixed-Node Diffusion Monte Carlo: Convergence of Nodes and Energy Differences vs Gaussian Basis-Set Size

Dubecký

2017

J. Chem. Theory Comput.

View full text Add to dashboard Cite

show abstract

“…To conduct cutting edge scientific research on material science, QMCPACK has been deployed on the current generation of leadership supercomputers, Mira (IBM Blue Gene/Q) at Argonne National Laboratory and Titan (AMD Opteron CPUs and NVIDIA Tesla K20 GPU) at Oak Ridge National Laboratory. As the most performance-critical component in QMCPACK, 3D B-splines orbitals have been extensively optimized over the years [13] [14]. The highly optimized routines evaluating B-spline SPOs are implemented in QPX intrinsics [15] on BG/Q, SSE/SSE2 intrinsics on x86 and in CUDA [16] to maximize single-node performance.…”

Section: Related Workmentioning

confidence: 99%

Optimization and Parallelization of B-Spline Based Orbital Evaluations in QMC on Multi/Many-Core Shared Memory Processors

Mathuriya¹,

Luo²,

Benali³

et al. 2017

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Self Cite

View full text Add to dashboard Cite

B-spline based orbital representations are widely used in Quantum Monte Carlo (QMC) simulations of solids, historically taking as much as 50% of the total run time. Random accesses to a large four-dimensional array make it challenging to efficiently utilize caches and wide vector units of modern CPUs. We present node-level optimizations of B-spline evaluations on multi/many-core shared memory processors. To increase SIMD efficiency and bandwidth utilization, we first apply data layout transformation from array-of-structures to structure-of-arrays (SoA). Then by blocking SoA objects, we optimize cache reuse and get sustained throughput for a range of problem sizes. We implement efficient nested threading in B-spline orbital evaluation kernels, paving the way towards enabling strong scaling of QMC simulations. These optimizations are portable on four distinct cache-coherent architectures and result in up to 5.6x performance enhancements on Intel R Xeon Phi TM processor 7250P (KNL), 5.7x on Intel R Xeon Phi TM coprocessor 7120P, 10x on an Intel R Xeon R processor E5v4 CPU and 9.5x on BlueGene/Q processor. Our nested threading implementation shows nearly ideal parallel efficiency on KNL up to 16 threads. We employ roofline performance analysis to model the impacts of our optimizations. This work combined with our current efforts of optimizing other QMC kernels, result in greater than 4.5x speedup of miniQMC on KNL.

show abstract

Efficient Runtime Support for a Partitioned Global Logical Address Space

Larkins

Snyder

Dinan

2018

Proceedings of the 47th International Conference on Parallel Processing

View full text Add to dashboard Cite

Global‐view coefficients: a data management solution for parallel quantum Monte Carlo applications

Cited by 3 publications

References 25 publications

Noncovalent Interactions by Fixed-Node Diffusion Monte Carlo: Convergence of Nodes and Energy Differences vs Gaussian Basis-Set Size

Noncovalent Interactions by Fixed-Node Diffusion Monte Carlo: Convergence of Nodes and Energy Differences vs Gaussian Basis-Set Size

Optimization and Parallelization of B-Spline Based Orbital Evaluations in QMC on Multi/Many-Core Shared Memory Processors

Efficient Runtime Support for a Partitioned Global Logical Address Space

Contact Info

Product

Resources

About