2013
DOI: 10.1177/1094342013490973
|View full text |Cite
|
Sign up to set email alerts
|

Communication-overlap techniques for improved strong scaling of gyrokinetic Eulerian code beyond 100k cores on the K-computer

Abstract: Plasma turbulence research based on five-dimensional (5D) gyrokinetic simulations is one of the most critical and demanding issues in fusion science. To pioneer new physics regimes both in problem sizes and in timescales, an improvement of strong scaling is essential. Overlap of computations and communications using non-blocking MPI communication schemes is a promising approach to improving strong scaling, but it often fails on practical applications with conventional MPI libraries. In this work, this classica… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2013
2013
2019
2019

Publication Types

Select...
5
3

Relationship

5
3

Authors

Journals

citations
Cited by 24 publications
(6 citation statements)
references
References 27 publications
0
6
0
Order By: Relevance
“…The transpose communications required for parallel 2D FFTs are implemented by using a simple blocking collective communi-cation, which makes implementation easy and the use of specific algorithms possible, like collective communications optimized for the K computer [11]. Second, transpose communications and FFT computations are overlapped by employing a communication thread in a hybrid parallel model of Message Passing Interface (MPI) and Open Multi-Processing (OpenMP), which enables overlaps of computations and blocking communications as well as non-blocking communications [12]. Masking of communication costs significantly improves strong scaling of the perpendicular-space parallelization.…”
Section: Introductionmentioning
confidence: 99%
“…The transpose communications required for parallel 2D FFTs are implemented by using a simple blocking collective communi-cation, which makes implementation easy and the use of specific algorithms possible, like collective communications optimized for the K computer [11]. Second, transpose communications and FFT computations are overlapped by employing a communication thread in a hybrid parallel model of Message Passing Interface (MPI) and Open Multi-Processing (OpenMP), which enables overlaps of computations and blocking communications as well as non-blocking communications [12]. Masking of communication costs significantly improves strong scaling of the perpendicular-space parallelization.…”
Section: Introductionmentioning
confidence: 99%
“…Since operations of 1D FFT are parallelized by OpenMP threads, we employ the thread-safe 1D FFTs by means of FFTW [14]. The idea of computation-communication overlaps with a communication thread has already been tested before [15][16][17]. However, for efficiently masking the communication cost by applying the pipelined overlaps, careful implementations are required: rearrangements of multiple computation kernels to keep computations enough to mask communications, a proper choice of the pipeline length (finer pipelining reduces unoverlapped parts at the beginning and the end of pipelining but increases latencies of MPI), and a regulation of the task granularity for thread parallelization (finer granularity reduces load imbalance on OpenMP threads but increases scheduling overheads).…”
Section: Computation-communication Overlapsmentioning
confidence: 99%
“…The domain decomposition model is implemented using a hybrid parallelization model consisting of multi-layer MPI communicators and multi-core OpenMP parallelization. In addition, a novel computation and communication overlap technique [20] is developed using communication threads, which are implemented with a heterogeneous OpenMP programing model. The strong scaling of GT5D is dramatically improved by this latency hiding technique, and on the K-computer, an excellent strong scaling is achieved up to ∼ 0.6 million cores (see Fig.…”
Section: Calculation Modelmentioning
confidence: 99%