The design of ultra scalable MPI collective communication on the K computer

Adachi, Tomoya; Shida, Naoyuki; Miura, Kiyotaka; Sumimoto, Shinji; Uno, Atsuya; Kurokawa, Motoyoshi; Shoji, Fumiyoshi; Yokokawa, Mitsuo

doi:10.1007/s00450-012-0211-7

Cited by 22 publications

(13 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While fast and efficient implementations of all-to-all reductions exist for today's HPC systems [1], they are quite fragile in the sense that a single failure leads to a wrong result on many nodes. Furthermore, it is commonly expected that the future may bring a slight shift from traditional parallel HPC systems towards less tightly coupled systems which need to be more distributed in nature due to the extreme scale and complexity required to go to Exascale and beyond.…”

Section: Introductionmentioning

confidence: 99%

Improving Fault Tolerance and Accuracy of a Distributed Reduction Algorithm

Niederbrucker

Straková

Gansterer

2012

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

View full text Add to dashboard Cite

Most existing algorithms for parallel or distributed reduction operations are not able to handle temporary or permanent link and node failures. Only recently, methods were proposed which are in principal capable of tolerating link and node failures as well as soft errors like bit flips or message loss. A particularly interesting example is the pushflow algorithm. However, on closer inspection, it turns out that in this method the failure recovery often implies severe performance drawbacks. Existing mechanisms for failure handling may basically lead to a fall-back to an early stage of the computation and consequently slow down convergence or even prevent convergence if failures occur too frequently. Moreover, state-of-the-art fault tolerant distributed reduction algorithms may experience accuracy problems even in failure free systems.We present the push-cancel-flow (PCF) algorithm, a novel algorithmic enhancement of the push-flow algorithm. We show that the new push-cancel-flow algorithm exhibits superior accuracy, performance and fault tolerance over all other existing distributed reduction methods. Moreover, we employ the novel PCF algorithm in the context of a fully distributed QR factorization process and illustrate that the improvements achieved at the reduction level directly translate to higher level matrix operations, such as the considered QR factorization.

show abstract

Section: Introductionmentioning

confidence: 99%

Improving Fault Tolerance and Accuracy of a Distributed Reduction Algorithm

Niederbrucker

Straková

Gansterer

2012

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

View full text Add to dashboard Cite

show abstract

“…Finally, N s segment groups are arranged so that reduction communications over v , μ and s are performed through a cross section of the 3D network. is the high-performance and highly-scalable MPI_AlltoAll on K [13], which fully utilizes the bisection bandwidth of 3D torus network but is available only when the communicating processes are arranged in a 3D box shape. Thus, the local 3D box-shaped mapping of rank_xy activates the optimized routine and minimizes the cost of the data transpose.…”

Section: Segmented Mapping On 3d Torus Networkmentioning

confidence: 99%

Improved strong scaling of a spectral/finite difference gyrokinetic code for multi-scale plasma turbulence

et al. 2015

View full text Add to dashboard Cite

“…26,27 However, all these optimization and MPI_Alltoallv, that use parameters whose sizes depend on the total number of parallel processes. In the past, various algorithms and optimized implementations for specific network topologies and hardware platforms have been developed for the set of collective MPI communication operations.…”

Section: Related Workmentioning

confidence: 99%

Flexible all‐to‐all data redistribution methods for grid‐based particle codes

Hofmann

Rünger

2018

Concurrency and Computation

View full text Add to dashboard Cite

The article proposes a fine-grained all-to-all communication operation that can implement flexible data redistribution patterns of irregular applications, such as particle codes. The flexibility is achieved by user-defined distribution functions, which are used to specify how data elements are to be redistributed among parallel processes on a distributed memory platform. The usage is illustrated for a particle data redistribution step of a grid-based particle code in which the destination processes for particles are calculated from the particle positions by a specific distribution function. Additionally, the fine-grained all-to-all communication operation proposed allows the duplication and modification of data elements during the data redistribution. This functionality is useful for automatically creating ghost particles for the domain decomposition of the particle code during the particle data redistribution step. The interface of the fine-grained all-to-all communication operation is described and several algorithms for implementing the operation on top of existing MPI operations are presented. Performance results on an IBM Blue Gene/Q platform demonstrate the performance of the communication operation proposed with synthetic benchmark data as well as with a parallel particle code. KEYWORDSall-to-all communication, data redistribution, distributed memory, message passing, particle simulations INTRODUCTIONParticle simulation methods are popular approaches for the numerical simulation of complex physical problems. A major computational part of particle codes usually consists of the calculations of the pair-wise interactions between the particles of a given particle system. Long-range interactions, such as Coulomb or gravitational interactions, can provide significant contributions to the results even for particles that are far away from each other in the particle system. Thus, all pair-wise interactions need to be considered, which leads to algorithmic and computational challenges especially for large particle systems. Solver methods for long-range interactions 1 have been developed for a high scaling parallel library within the ScaFa-CoS project. 2 This library includes parallel implementations of tree-based methods, such as the Fast Multipole Method (FMM) 3 or the Barnes-Hut algorithm, 4 as well as grid-based methods, such as Particle-Particle-Particle Mesh (P3M) 5 or fast summations based on nonequispaced fast Fourier transforms (P2NFFT). 6 All these parallel solver methods include a solver-specific distribution of the particle data among the parallel processes executed on a distributed memory platform. Applying such a parallel solver method to a specific particle application code includes data redistribution steps between the specific particle application code and the parallel solver method of the ScaFaCoS library.Data redistribution in particle codes has to be performed efficiently such that their runtime is negligible in comparison to the computational costs of the particle interactions. Message passing li...

show abstract

The design of ultra scalable MPI collective communication on the K computer

Cited by 22 publications

References 6 publications

Improving Fault Tolerance and Accuracy of a Distributed Reduction Algorithm

Improving Fault Tolerance and Accuracy of a Distributed Reduction Algorithm

Improved strong scaling of a spectral/finite difference gyrokinetic code for multi-scale plasma turbulence

Flexible all‐to‐all data redistribution methods for grid‐based particle codes

Contact Info

Product

Resources

About