Finite element numerical integration for first order approximations on multi- and many-core architectures

Banaś, Krzysztof; Krużel, Filip; Bielański, Jan

doi:10.1016/j.cma.2016.03.038

Cited by 28 publications

(13 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, optimizing this kernel to take advantage of current architectures, from the cache hierarchy to the different levels of parallelism, is not straightforward and the scientific literature dealing with this topic is abundant. For instance, optimized implementations on GPU have been described in [2], [3], [4]. Most of these approaches implement mesh coloring strategy and fully benefit from the memory bandwidth available on the underlying architecture.…”

Section: Related Workmentioning

confidence: 99%

“…Most of these approaches implement mesh coloring strategy and fully benefit from the memory bandwidth available on the underlying architecture. At the shared-memory level, FEM implementations described in [5], [4], [6] underlines the impact of SIMD instructions and data-reuse at the cache memory level. Additionally, advanced algorithms described in [7] introduced a divide and conquer methodology to build a tree of dependent tasks.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Data-Layout Reorganization for an Efficient Intra-Node Assembly of a Spectral Finite-Element Method

Sornet

Jubertie

Dupros

et al. 2018

2018 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)

View full text Add to dashboard Cite

The Finite-Element Method (FEM) is routinely used to solve Partial Differential Equations (PDE) in various scientific domains. For seismic waves modeling, the Spectral Element Method (SEM), which is a specific formulation of the classical FEM approach, have gained significant attention for the last two decades. This is explained both from the very good numerical accuracy of this method and from the parallel performance of classical MPI-based implementations that scale up to several tens of thousands computing cores. Nevertheless, the trend for current processors with an increasing level of low-level parallelism requires significant efforts at the shared-memory level. One major bottleneck is coming from the standard FEM assembly phase that leads to significant amount of irregular memory accesses. This prevents any efficient automatic optimizations from the compiler for instance. In this paper, we extract a kernel from a spectral-element application dedicated to earthquake simulations in complex geological medium (EFISPEC code developed at BRGM, the French Geological Survey). We study the intranode behavior and we propose different levels of optimization (data-layout, manual vectorization, multithreading) to fully benefit from SIMD units and NUMA architectures. Experiments performed on Intel Broadwell architecture show that the proposed optimizations dramatically improve the intra-node performance of the mini-application. Moreover, our results show a good match with rooflines theoretical performance models. We believe that these optimizations are not specific to this mini-application and may be implemented in different SEM and FEM based solvers as well.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Data-Layout Reorganization for an Efficient Intra-Node Assembly of a Spectral Finite-Element Method

Sornet

Jubertie

Dupros

et al. 2018

2018 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)

View full text Add to dashboard Cite

show abstract

“…free_if(.false.)) (6) !$omp parallel do private (ik, i, k) (7) do ik � 1, ik_total (8) i � (ik − 1)/(kd − 1) + 1 (9) k � mod(ik − 1, kd − 1) + 1 (10) call Fluxj_mic (mbc_n, i, k Scientific Programming nested loops with fluxes computing. erefore, we merge the two loops inside to provide larger data set for vectorization on MIC.…”

Section: Vectorizationmentioning

confidence: 99%

“…Wang et al [8] reported the large-scale computation of a highorder CFD code on Tianhe-2 supercomputer that consists of both CPU and MIC coprocessors. And other CFD-related works on Intel MIC architecture can be found in references [9][10][11][12]. Working as coprocessors, GPUs also have been popular in CFD.…”

Section: Introductionmentioning

confidence: 99%

Implementation and Optimization of a CFD Solver Using Overlapped Meshes on Multiple MIC Coprocessors

Yuan

2019

Scientific Programming

View full text Add to dashboard Cite

In this paper, we develop and parallelize a CFD solver that supports overlapped meshes on multiple MIC architectures by using multithreaded technique. We optimize the solver through several considerations including vectorization, memory arrangement, and an asynchronous strategy for data exchange on multiple devices. Comparisons of different vectorization strategies are made, and the performances of core functions of the solver are reported. Experiments show that about 3.16x speedup can be achieved for the six core functions on a single Intel Xeon Phi 5110P MIC card, and 5.9x speedup can be achieved using two cards compared to an Intel E5-2680 processor for two ONERA M6 wings case.

show abstract

“…Brook et al detailed their early efforts to port and optimize scientific and engineering application codes to the Intel MIC architecture. Banaś et al presented investigations on the performance of the finite element numerical integration algorithm and 3 processor architectures, popular in scientific computing, classical x86_64 CPU, Intel Xeon Phi, and NVIDIA Kepler GPU. Kahale et al explored the NS equation and its solution methodology using multigrid method and Intel Xeon Phi accelerator device.…”

Section: Introductionmentioning

confidence: 99%

An approach to enhance the performance of large‐scale structural analysis on CPU‐MIC heterogeneous clusters

Miao

Jin

Ding

2016

Concurrency and Computation

View full text Add to dashboard Cite

Summary Clusters with the CPU‐MIC heterogeneous architecture are becoming more popular in recent years. However, it is not easy to achieve good performance on such machines. The key challenge has been the asymmetry within clusters, arising from different kinds of execution units as well as different communication latencies. To improve the performance of large‐scale structural analysis on CPU‐MIC heterogeneous clusters, a multi‐layer and multi‐grain collaborative parallel computing approach is proposed in the paper. The proposed method combines the parallel algorithm and the hardware architecture of CPU‐MIC heterogeneous clusters together. Through mapping computing tasks to various hardware layers, it both resolves the load balance problem between CPU and MIC devices and significantly reduces the communication overheads of the system. Numerical experiments conducted on Tianhe‐2 supercomputer show that the proposed method obtained better performance compared with the traditional approach. Scalability investigation showed that the proposed method had good scalability with respect to problem sizes. The findings of this paper are of help to the parallel porting and performance optimization of other applications on CPU‐MIC heterogeneous clusters.

show abstract

Finite element numerical integration for first order approximations on multi- and many-core architectures

Cited by 28 publications

References 28 publications

Data-Layout Reorganization for an Efficient Intra-Node Assembly of a Spectral Finite-Element Method

Data-Layout Reorganization for an Efficient Intra-Node Assembly of a Spectral Finite-Element Method

Implementation and Optimization of a CFD Solver Using Overlapped Meshes on Multiple MIC Coprocessors

An approach to enhance the performance of large‐scale structural analysis on CPU‐MIC heterogeneous clusters

Contact Info

Product

Resources

About