Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional eulerian code on many core platforms

Idomura, Yasuhiro; Ina, Takuya; Mayumi, Akie; Yamada, S.; Matsumoto, Kazuya; Asahi, Yuuichi; Imamura, Toshiyuki

doi:10.1145/3148226.3148234

Cited by 7 publications

(7 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An important competitor for GPUs is the Intel Xeon Phi, which has similar characteristics to GPUs and is also regarded as an accelerator. Encouraging results for Xeon Phi acceleration have also been reported …”

Section: Introductionmentioning

confidence: 76%

Overlapping communications in gyrokinetic codes on accelerator‐based platforms

Asahi

Latu

Bigot

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

Communication and computation overlapping techniques have been introduced in the five-dimensional gyrokinetic codes GYSELA and GKV. In order to anticipate some of the exa-scale requirements, these codes were ported to the modern accelerators, Xeon Phi KNL and Tesla P 100 GPU. On accelerators, a serial version of GYSELA on KNL and GKV on GPU are respectively 1.3× and 7.4× faster than those on a single Skylake processor (a single socket). For the scalability, we have measured GYSELA performance on Xeon Phi KNL from 16 to 512 KNLs (1024 to 32k cores) and GKV performance on Tesla P 100 GPU from 32 to 256 GPUs. In their parallel versions, transpose communication in semi-Lagrangian solver in GYSELA or Convolution kernel in GKV turned out to be a main bottleneck. This indicates that in the exa-scale, the network constraints would be critical. In order to mitigate the communication costs, the pipeline and task-based overlapping techniques have been implemented in these codes. The GYSELA 2D advection solver has achieved a 33% to 92% speed up, and the GKV 2D convolution kernel has achieved a factor of 2 speed up with pipelining. The task-based approach gives 11% to 82% performance gain in the derivative computation of the electrostatic potential in GYSELA.We have shown that the pipeline-based approach is applicable with the presence of symmetry, while the task-based approach can be applicable to more general situations. KEYWORDSoverlap, semi-Lagrangian, spectral, Tesla P100 GPU, transpose communication, Xeon Phi KNL INTRODUCTIONIt is known that turbulence in a magnetic confined fusion plasma exhibits strong anisotropies in parallel and perpendicular directions to magnetic fields. In the parallel direction along the magnetic field line, the characteristic scale is the machine size (ie, the order of meter), while in the perpendicular direction, the characteristic size is down to a tiny Larmor radius scale (ie, the order of centimeter). In numerical simulations, higher resolution is needed in the perpendicular direction than in the parallel direction. It is therefore reasonable to parallelize such a simulation in the perpendicular directions since they require a large numbers of grid points. There are several algorithms used in this field, which need a global data structure in the perpendicular directions, like spectral or semi-Lagrangian methods. In order to gather distributed data, we need the so-called transpose communication. However, this type of communication is relatively demanding.So as to mask the communication cost, the pipeline-based computation and communication overlapping method was proposed for 5Dgyrokinetic simulation codes based on the finite difference method 1 and the spectral method. 2,3 An improved strong scaling up to 600 k cores was demonstrated on a conventional CPU-based supercomputer. 1,3 It is, however, still questionable to have a similar scalability on state-of-the art supercomputing systems employing accelerators, where the computational power is enormously increased while the improvement of the ...

show abstract

Section: Introductionmentioning

confidence: 76%

Overlapping communications in gyrokinetic codes on accelerator‐based platforms

Asahi

Latu

Bigot

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…In most of performance studies [4,13,15], CA Krylov methods were applied to well-conditioned problems, where CA-steps are extended for s > 10. However, in Ref.…”

Section: Related Workmentioning

confidence: 99%

“…As for OpenMP parallelization, we use 12 and 68 threads on ICEX and KNL, respectively. On KNL, we choose 68 threads without hyper threading to avoid performance degradation in MPI communications [4]. Although KNL has hierarchical memory architecture consisting of MCDRAM (16 GByte, B∼480 GByte/s) and DDR4 (96 GByte), we suppress the problem size below 16 GB per node and use only MCDRAM in a flat mode.…”

Section: Computing Platformsmentioning

confidence: 99%

“…In order to resolve this issue at mathematics or algorithm levels, in Refs. [3,4], we have introduced communication-avoiding (CA) Krylov methods to a fusion plasma turbulence code GT5D [5] and a multiphase thermal-hydraulic CFD code JUPITER [6].The implicit solver in the GT5D is well-conditioned, and the communicationavoiding general minimum residual (CA-GMRES) method [7] was stable for large CA-steps s > 10. On the other hand, the Poisson solver in the JUPITER is ill-conditioned, and the convergence of the left-preconditioned communicationavoiding conjugate gradient (P-CACG) method [7] was limited for s ≤ 3.…”

mentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Application of a Preconditioned Chebyshev Basis Communication-Avoiding Conjugate Gradient Method to a Multiphase Thermal-Hydraulic CFD Code

Idomura

Ina

Mayumi

et al. 2018

Supercomputing Frontiers

View full text Add to dashboard Cite

A preconditioned Chebyshev basis communication-avoiding conjugate gradient method (P-CBCG) is applied to the pressure Poisson equation in a multiphase thermal-hydraulic CFD code JUPITER, and its computational performance and convergence properties are compared against a preconditioned conjugate gradient (P-CG) method and a preconditioned communication-avoiding conjugate gradient (P-CACG) method on the Oakforest-PACS, which consists of 8,208 KNLs. The P-CBCG method reduces the number of collective communications with keeping the robustness of convergence properties. Compared with the P-CACG method, an order of magnitude larger communication-avoiding steps are enabled by the improved robustness. It is shown that the P-CBCG method is 1.38× and 1.17× faster than the P-CG and P-CACG methods at 2,000 processors, respectively. the accelerated computation revealed severe bottlenecks of communication. Krylov solvers involve local halo data communications for stencil computations or sparse matrix vector operations SpMVs, and global data reduction communications for inner product operations in orthogonalization procedures for basis vectors. Although communication overlap techniques [2] may reduce the former latency, it can not be applied to the latter. In order to resolve this issue at mathematics or algorithm levels, in Refs. [3,4], we have introduced communication-avoiding (CA) Krylov methods to a fusion plasma turbulence code GT5D [5] and a multiphase thermal-hydraulic CFD code JUPITER [6].The implicit solver in the GT5D is well-conditioned, and the communicationavoiding general minimum residual (CA-GMRES) method [7] was stable for large CA-steps s > 10. On the other hand, the Poisson solver in the JUPITER is ill-conditioned, and the convergence of the left-preconditioned communicationavoiding conjugate gradient (P-CACG) method [7] was limited for s ≤ 3. Even with s = 3, the strong scaling of the JUPITER on the K-computer [8] was dramatically improved by reducing the number of global data reduction communications to 1/s. However, for practical use, it is difficult to operate CA Krylov solvers at the upper limit of CA-steps, because the Poisson operator is time dependent and its condition number may increase in time. Therefore, we need to use more robust CA Krylov methods at CA-steps well below the upper limit, beyond which they become numerically unstable. In order to resolve this issue, in this work, we introduce the preconditioned Chebyshev basis communicationavoiding conjugate gradient (P-CBCG) method to the JUPITER, and examine its robustness and computational performance on the Oakforest-PACS, which consists of 8,208 KNLs.The reminder of this paper is organized as follows. Related works are reviewed in Sect. 2. In Sect. 3, we explain CA Krylov subspace methods used in this work. In Sect. 4, we discuss numerical properties and kernel performances of CA Krylov solvers. In Sect. 5, we present the convergence property of CA Krylov methods and the computational performances of CA Krylov solvers on the JAEA ICEX and the O...

show abstract