Shared Memory Parallelism for 3D Cartesian Discrete Ordinates Solver

Moustafa, Salli; Dutka-Malen, Ivan; Plagne, Laurent; Ponçot, Angélique; Ramet, Pierre

doi:10.1051/snamc/201404105

Cited by 4 publications

(8 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This last change can divide by up to four the arithmetic intensity of the kernel. This confirms the preliminary results on shared memory systems with INTEL TBB presented in [18] against the DENOVO and PENTRAN code.…”

Section: Comparison With Snap/partisnsupporting

confidence: 87%

“…However, given that modern supercomputer architectures are becoming more and more heterogeneous (presence of accelerators inside computing nodes) and hybrid (interconnection of several nodes), it may be important to review classical parallel programming models as shown in the paper [17]. In a previous work [18], we have presented the DOMINO neutron transport solver designed for those modern architectures. We have especially showed that: 1) a good data locality dramatically improves arithmetic intensity of the sweep operation, and allows us to efficiently exploit SIMD units available inside current processors; 2) usage of the task-based programming model helped us to parallelize the sweep of DOMINO, by relying on INTEL TBB [19] library that addresses shared memory supercomputing nodes.…”

Section: Related Workmentioning

confidence: 99%

“…Here only one kernel is required to update angular fluxes in one quadrant (or octant in 3D) for one cell. It is vectorized over angular directions thanks to the C++ generic library Eigen [27] and has been presented in [18].…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

3D Cartesian Transport Sweep for Massively Parallel Architectures with PaRSEC

Moustafa

Faverge

Plagne

et al. 2015

2015 IEEE International Parallel and Distributed Processing Symposium

Self Cite

View full text Add to dashboard Cite

High-fidelity nuclear power plant core simulations require solving the Boltzmann transport equation. In discrete ordinates methods, the most computationally demanding operation of this equation is the sweep operation. Considering the evolution of computer architectures, we propose in this paper, as a first step toward heterogeneous distributed architectures, a hybrid parallel implementation of the sweep operation on top of the generic task-based runtime system: PARSEC. Such an implementation targets three nested levels of parallelism: message passing, multi-threading, and vectorization. A theoretical performance model was designed to validate the approach and help the tuning of the multiple parameters involved in such an approach. The proposed parallel implementation of the Sweep achieves a sustained performance of 6.1 Tflop/s, corresponding to 33.9% of the peak performance of the targeted supercomputer. This implementation compares favorably with state-ofart solvers such as PARTISN; and it can therefore serve as a building block for a massively parallel version of the neutron transport solver DOMINO developed at EDF.

show abstract

Section: Comparison With Snap/partisnsupporting

confidence: 87%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

3D Cartesian Transport Sweep for Massively Parallel Architectures with PaRSEC

Moustafa

Faverge

Plagne

et al. 2015

2015 IEEE International Parallel and Distributed Processing Symposium

Self Cite

View full text Add to dashboard Cite

show abstract

“…The performance of a shared memory node has also been investigated [17]. Intel Thread Building Block (TBB) tasks were used to maintain a task dependency graph of cells within a wavefront.…”

Section: Related Workmentioning

confidence: 99%

An improved parallelism scheme for deterministic discrete ordinates transport

Deakin

McIntosh–Smith

Martineau

et al. 2016

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

General rightsThis document is made available in accordance with publisher policies. Please cite only the published version using the reference above. AbstractIn this paper we demonstrate techniques for increasing the node-level parallelism of a deterministic discrete ordinates neutral particle transport algorithm on a structured mesh to exploit many-core technologies. Transport calculations form a large part of the computational workload of physical simulations and so good performance is vital for the simulations to complete in reasonable time. We will demonstrate our approach utilizing the SNAP mini-app, which gives a simplified implementation of the full transport algorithm but remains similar enough to the real algorithm to act as a useful proxy for research purposes.We present an OpenCL implementation of our improved algorithm which achieves a speedup of up to 2.5× on a many-core GPGPU device compared to a state-of-the-art multi-core node for the transport sweep, and up to 4× compared to the multi-core CPUs in the largest GPU enabled supercomputer; the first time this scale of speedup has been achieved for algorithms of this class. We then discuss ways to express our scheme in OpenMP 4.0 and demonstrate the performance on an Intel Knights Corner Xeon Phi compared to the original

show abstract

“…The performance of a shared memory node has also been investigated [13]. Intel Thread Building Block (TBB) tasks were used to maintain a task dependency graph of cells within a wavefront.…”

Section: Related Workmentioning

confidence: 99%

Expressing Parallelism on Many-Core for Deterministic Discrete Ordinates Transport

Deakin

McIntosh–Smith

Gaudin

2015

2015 IEEE International Conference on Cluster Computing

View full text Add to dashboard Cite

In this paper we demonstrate techniques for increasing the node-level parallelism of a deterministic discrete ordinates neutral particle transport algorithm on a structured mesh to exploit many-core technologies. Transport calculations form a large part of the computational workload of physical simulations and so good performance is vital for the simulations to complete in reasonable time. We will demonstrate our approach utilizing the SNAP mini-app, which gives a simplified implementation of the full transport algorithm but remains similar enough to the real algorithm to act as a useful proxy for research purposes.We present an OpenCL implementation of our improved algorithm which demonstrates a speedup of up to 2.5x the transport sweep performance on a many-core GPGPU device compared to a state-of-the-art multi-core node; the first time this scale of speedup has been achieved for algorithms of this class.

show abstract

Shared Memory Parallelism for 3D Cartesian Discrete Ordinates Solver

Cited by 4 publications

References 4 publications

3D Cartesian Transport Sweep for Massively Parallel Architectures with PaRSEC

3D Cartesian Transport Sweep for Massively Parallel Architectures with PaRSEC

An improved parallelism scheme for deterministic discrete ordinates transport

Expressing Parallelism on Many-Core for Deterministic Discrete Ordinates Transport

Contact Info

Product

Resources

About