Looking under the hood of the IBM Blue Gene/Q network

Chen, Dong; Eisley, Noel A.; Heidelberger, Philip; Kumar, Sameer; Mamidala, Amith R.; Petrini, Fabrizio; Senger, Robert M.; Sugawara, Yutaka; Walkup, R. E.; Steinmacher-Burow, Burkhard; Choudhury, Anamitra R.; Sabharwal, Yogish; Singhal, Swati; Parker, Jeff

doi:10.1109/sc.2012.72

Cited by 50 publications

(43 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This leads to a minimal total collective latency of 1.7 · 10 −3 s for the considered case, which is orders of magnitude below the timings measured by Scalasca. MPI introduces some overhead, increasing the latency from 1.8 μs to 5.3 μs for 2 048 nodes and 16 processes per node as reported in [13]. Our measurements show variations of 10-100 in time, e.g., depending on the data type or the usage of MPI Reduce/MPI Allreduce.…”

Section: Model Predictionmentioning

confidence: 50%

See 1 more Smart Citation

Performance and Scalability of Hierarchical Hybrid Multigrid Solvers for Stokes Systems

Gmeiner¹,

Rüde²,

Stengel³

et al. 2015

SIAM J. Sci. Comput.

View full text Add to dashboard Cite

In many applications involving incompressible fluid flow, the Stokes system plays an important role. Complex flow problems may require extremely fine resolutions, easily resulting in saddle-point problems with more than a trillion (10 12 ) unknowns. Even on the most advanced supercomputers, the fast solution of such systems of equations is a highly nontrivial and challenging task. In this work we consider a realization of an iterative saddle-point solver which is based mathematically on the Schur-complement formulation of the pressure and algorithmically on the abstract concept of hierarchical hybrid grids. The design of our fast multigrid solver is guided by an innovative performance analysis for the computational kernels in combination with a quantification of the communication overhead. Excellent node performance and good scalability to almost a million parallel threads are demonstrated on different characteristic types of modern supercomputers.1. Introduction. Current leading edge supercomputers can provide performance in the order of several petaflop/s, enabling the development of increasingly complex and accurate computational models having unprecedented size. This is especially relevant in flow simulations that may exhibit many small scale features that must be resolved over large domains. As an example, the problem of earth mantle convection is posed on a thick spherical shell of approximately 3 000 km depth and 6 300 km radius, resulting in an overall volume of close to a trillion, that is, 10 12 km 3 . A high resolution then results automatically in huge algebraic systems.Although finite element (FE) methods are flexible enough to handle different local mesh-sizes, fully adaptive meshing techniques require dynamic data structures and a complex program control flow that incurs significant computational cost. Recent work on parallel adaptive FE techniques can be found, e.g., in [1,2,11,44]. In [10] it is shown that an adaptive parallel FE method can reach locally 1 km resolution for the mantle convection problem on a large scale supercomputer. Here we will demonstrate that such a resolution can even be reached globally.Higher order FE approaches can lead to a better accuracy with the same number of unknowns, but the linear systems are denser. This implies more computational work, more memory access cost, and also higher parallel communication cost, so

show abstract

Section: Model Predictionmentioning

confidence: 50%

“…In total, 10 V(3,3)-cycles are executed within seven SCG iterations. The all-to-all bandwidth is bw comm = 8 · 1.8 GB/s/L dim , where L dim is the longest dimension in the torus network [13]. The partitioning for the considered measurement is (4 × 2 × 4 × 4 × 2); hence Downloaded 07/18/15 to 139.80.123.40.…”

Section: Communication Performancementioning

confidence: 99%

Performance and Scalability of Hierarchical Hybrid Multigrid Solvers for Stokes Systems

Gmeiner¹,

Rüde²,

Stengel³

et al. 2015

SIAM J. Sci. Comput.

View full text Add to dashboard Cite

show abstract

“…The IBM Blue Gene/Q system [3] is designed to provide high-performance, low power consumption supercomputing. Argonnes Mira system contains 48 racks, 768K cores, and has a theoretical peak performance of 10 petaflops.…”

Section: A Mira -A Blue Gene/q Supercomputermentioning

confidence: 99%

Scalable Parallel I/O on a Blue Gene/Q Supercomputer Using Compression, Topology-Aware Data Aggregation, and Subfiling

Bui

Finkel

Vishwanath

et al. 2014

2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

View full text Add to dashboard Cite

In this paper, we propose an approach to improving the I/O performance of an IBM Blue Gene/Q supercomputing system using a novel framework that can be integrated into high performance applications. We take advantage of the systems tremendous computing resources and high interconnection bandwidth among compute nodes to efficiently exploit I/O bandwidth. This approach focuses on lossless data compression, topologyaware data movement, and subfiling. The efficacy of this solution is demonstrated using microbenchmarks and an application-level benchmark.

show abstract

“…Mesh networks are appealing due to their use in parallel and distributed computing [17,1,10,23]. Mesh networks are cost-effective and provide great performance solutions for diverse applications, simple expansion for future growth, and scalable connection properties.…”

Section: Contributionsmentioning

confidence: 99%

“…For example, 65,000 nodes of IBM Blue Gene/L are interconnected as a 64 × 32 × 32 3-dimensional mesh or torus [1]. Recently, IBM Blue Gene/Q integrated 5-dimensional torus [10], where a torus is a variation of the mesh.…”

Section: Contributionsmentioning

confidence: 99%