Memory bottlenecks and memory contention in multi-core Monte Carlo transport codes

Tramm, John R.; Siegel, A.

doi:10.1016/j.anucene.2014.08.038

Cited by 23 publications

(13 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Significant speedup of simulation performance can be easily achieved by enabling OpenMP in MC codes but scaling degradation at high core counts is attributed to complex both hardware and software factors, such as memory bottlenecks (Tramm and Siegel, 2013;.…”

Section: Related Workmentioning

confidence: 99%

“…Recent efforts to parallelize MC codes also include leveraging multi-core architecture (Tramm and Siegel, 2013; Siegel et al, 2014) and graphics processing units (GPUs) (Boyd et al, 2013). Significant speedup of simulation performance can be easily achieved by enabling OpenMP in MC codes but scaling degradation at high core counts is attributed to complex both hardware and software factors, such as memory bottlenecks (Tramm and Siegel, 2013; Siegel et al, 2014).…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Data decomposition in Monte Carlo neutron transport simulations using global view arrays

Dun

Fujita

Tramm

et al. 2015

The International Journal of High Performance Computing Applica

Self Cite

View full text Add to dashboard Cite

Accommodating large tally data can be a challenging problem for Monte Carlo neutron transport simulations. Current approaches include either simple data replication, or are based on application-controlled decomposition such as domain partitioning or client/server models, which are limited by either memory cost or performance loss. We propose and analyze an alternative solution based on global view arrays. By using global view arrays, tallies are naturally partitioned into small globally addressable blocks that fit in the limited on-node memory of compute nodes, achieving both highly scalable memory and performance efficiency. This approach also greatly simplifies the programmability compared with application-controlled approaches. Our implementation is based on integrating a global view library built on MPI one-sided communication, global view resilience (GVR), into the OpenMC Monte Carlo transport code. The remote memory access (RMA)-based global view array implementation is able to achieve 85% efficiency at 16,384 processes compared with 1,000 processes with 2.39 TB mesh tally across 1,366 nodes on a Cray XC30 supercomputer. Our results improve scalability significantly compared with the tally server approach and are better than any other published results, indicating that global view array is a promising alternative to enable full-core light water reactor analysis on current and future computer systems.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Data decomposition in Monte Carlo neutron transport simulations using global view arrays

Dun

Fujita

Tramm

et al. 2015

The International Journal of High Performance Computing Applica

Self Cite

View full text Add to dashboard Cite

show abstract

“…On one hand, Doppler Broadening introduces compute-intensive FLOP work between frequent memory loads to mitigate the latency-bound bottleneck mainly induced by the binary search in the pre-tabulated cross section approach [10]. On the other hand, temperature dependent cross section data are computed on-thefly whenever they are requested, which saves significantly the memory footprint of program.…”

Section: Benchmarkmentioning

confidence: 99%

Portable Monte Carlo Transport Performance Evaluation in the PATMOS Prototype

Chang

Brun

Calvin

2020

Parallel Processing and Applied Mathematics

View full text Add to dashboard Cite

A heterogeneous offload version of Monte Carlo neutron transport has been developed in the framework of PATMOS prototype via several programming models (OpenMP thread, OpenMP offload, Ope-nACC and CUDA). Two algorithms are implemented, including both history-based method and pseudo event-based method. A performance evaluation has been carried out with a representative benchmark, sla-bAllNuclides. Numerical results illustrate the promising gain in performance for our heterogeneous offload MC code. These results demonstrate that pseudo event-based approach outperforms history-based approach significantly. Furthermore, by using pseudo event-based method, the OpenACC version is competitive enough, obtaining at least 71% performance comparing to the CUDA version, wherein the OpenMP offload version renders low performance for both approaches.

show abstract

“…Parallel algorithms for Monte Carlo methods on distributed memory systems have been a vibrant area of research for many decades, including recent advances in distributed fission banks (Romano and Forget, 2013) and spatial domain decomposition (Horelik et al, 2014). However, studies in on-node parallelism have pointed to some key issues—scaling limitations due to memory contention (Siegel et al, 2014; Tramm and Siegel, 2013) and the difficulty of formulating Monte Carlo approaches with SIMD parallelism (Nelson, 2009).…”

Section: Introductionmentioning

confidence: 99%

Parallel performance results for the OpenMOC neutron transport code on multicore platforms

Boyd

Siegel

et al. 2016

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

The shift towards multi-core architectures has ushered in a new era of shared memory parallelism for scientific applications. This transition has introduced challenges for the nuclear engineering community as it seeks to design high-fidelity full-core reactor physics simulation tools. This paper describes the parallel transport sweep algorithm in the OpenMOC method of characteristics (MOC) neutron transport code for multi-core platforms using OpenMP. Strong and weak scaling studies are performed for both Intel Xeon and IBM Blue Gene/Q multi-core processors. The results demonstrate 100% parallel efficiency for 12 threads on 12 cores on Intel Xeon platforms, and over 90% parallel efficiency with 64 threads on 16 cores on the IBM Blue Gene/Q. These results illustrate the potential for hardware acceleration for MOC neutron transport on modern multi-core and future many-core architectures. In addition, this work highlights the pitfalls of programming for multi-core architectures, with a focal point on false sharing.

show abstract

Memory bottlenecks and memory contention in multi-core Monte Carlo transport codes

Abstract: a b s t r a c tWe have extracted a kernel that executes only the most computationally expensive steps of the Monte Carlo particle transport algorithm -the calculation of macroscopic cross sections -in an effort to expose bottlenecks within multi-core, shared memory architectures.

Cited by 23 publications

References 12 publications

Data decomposition in Monte Carlo neutron transport simulations using global view arrays

Data decomposition in Monte Carlo neutron transport simulations using global view arrays

Portable Monte Carlo Transport Performance Evaluation in the PATMOS Prototype

Parallel performance results for the OpenMOC neutron transport code on multicore platforms

Contact Info

Product

Resources

About