MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks

Cappello, Franck; Etiemble, Daniel

doi:10.1109/sc.2000.10001

Cited by 158 publications

(108 citation statements)

References 11 publications

Supporting

Mentioning

104

Contrasting

Unclassified

Order By: Relevance

“…Recent work with hybrid programming models for clusters of SMPs has often focused on determining the best split of threads and processes, and the shape of the domains used by each thread [1,2]. In fact, these static decompositions are often auto-tuned for specific architectures to achieve reasonable performance gains.…”

Section: Introductionmentioning

confidence: 99%

Load Balancing for Regular Meshes on SMPs with MPI

Kale

Gropp

2010

Recent Advances in the Message Passing Interface

View full text Add to dashboard Cite

Domain decomposition for regular meshes on parallel computers has traditionally been performed by attempting to exactly partition the work among the available processors (now cores). However, these strategies often do not consider the inherent system noise which can hinder MPI application scalability to emerging peta-scale machines with 10000+ nodes. In this work, we suggest a solution that uses a tunable hybrid static/dynamic scheduling strategy that can be incorporated into current MPI implementations of mesh codes. By applying this strategy to a 3D jacobi algorithm, we achieve performance gains of at least 16% for 64 SMP nodes.

show abstract

Section: Introductionmentioning

confidence: 99%

Load Balancing for Regular Meshes on SMPs with MPI

Kale

Gropp

2010

Recent Advances in the Message Passing Interface

View full text Add to dashboard Cite

show abstract

“…Many of the hybrid model papers note ben-364 efits occurring only as the number of nodes grows [26,36,38] step. Therefore, we use a trilinear interpolation in the prolongation stage.…”

mentioning

confidence: 99%

Multi-level parallelism for incompressible flow computations on GPU clusters

Jacobsen

Şenocak

2013

Parallel Computing

View full text Add to dashboard Cite

We investigate multi-level parallelism on GPU clusters with MPI-CUDA and hybrid MPI-OpenMP-CUDA parallel implementations, in which all computations are done on the GPU using CUDA. We explore efficiency and scalability of incompressible flow computations using up to 256 GPUs on a problem with approximately 17.2 billion cells. Our work addresses some of the unique issues faced when merging fine-grain parallelism on the GPU using CUDA with coarse-grain parallelism that use either MPI or MPI-OpenMP for communications. We present three different strategies to overlap computations with communications, and systematically assess their impact on parallel performance on two different GPU clusters. Our results for strong and weak scaling analysis of incompressible flow computations demonstrate that GPU clusters offer significant benefits for large data sets, and a dual-level MPI-CUDA implementation with maximum overlapping of computation and communication provides substantial benefits in performance. We also find that our tri-level MPI-OpenMP-CUDA parallel implementation does not offer a significant advantage in performance over the dual-level implementation on GPU clusters with two GPUs per node, but on clusters with higher GPU counts per node or with different domain decomposition strategies a tri-level implementation may exhibit higher efficiency than a dual-level implementation and needs to be investigated further.

show abstract

“…This includes work in extending MPI itself [19] as well as other models including UPC [20], CoArray Fortran [21], Global Arrays [22], OpenMP [23] and hybrid programming models (MPI + OpenMP [24], MPI + UPC). While this paper utilizes MPI for measuring the network congestion behavior, most of the insights are independent of MPI and do give a general indication of potential pitfalls other models might run into as well.…”

Section: Nearest-neighbor Communicationmentioning

confidence: 99%

Understanding Network Saturation Behavior on Large-Scale Blue Gene/P Systems

Balaji

Naik

Desai

2009

2009 15th International Conference on Parallel and Distributed Systems

View full text Add to dashboard Cite

Abstract-As researchers continue to architect massivescale systems, it is becoming clear that these systems will utilize a significant amount of shared hardware between processing units. Systems such as the IBM Blue Gene (BG) and Cray XT have started utilizing flat (i.e., scalable) networks, which differ from switched fabrics in that they use a 3D torus or similar topology. This allows the network to grow only linearly with system scale, instead of the super linear growth needed for full fat-tree switched topologies, but at the cost of increased network sharing between processing nodes. While in many cases a full fat-tree is an over estimate of the needed bisectional bandwidth, it is not clear whether the other extreme of a flat topology is sufficient to move data around the network efficiently. In this paper, we study the network behavior of the IBM BG/P using several application communication kernels, and we monitor network congestion behavior based on detailed hardware counters. Our studies scale from small systems to 8 racks (32,768 cores) of BG/P and provide insights into the network communication characteristics of the system.As researchers continue to architect massivescale systems, it is becoming clear that these systems will utilize a significant amount of shared hardware between processing units. Systems such as the IBM Blue Gene (BG) and Cray XT have started utilizing flat (i.e., scalable) networks, which differ from switched fabrics in that they use a 3D torus or similar topology. This allows the network to grow only linearly with system scale, instead of the super linear growth needed for full fat-tree switched topologies, but at the cost of increased network sharing between processing nodes. While in many cases a full fat-tree is an over estimate of the needed bisectional bandwidth, it is not clear whether the other extreme of a flat topology is sufficient to move data around the network efficiently. In this paper, we study the network behavior of the IBM BG/P using several application communication kernels, and we monitor network congestion behavior based on detailed hardware counters. Our studies scale from small systems to 8 racks (32,768 cores) of BG/P and provide insights into the network communication characteristics of the system.

show abstract

MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks

Abstract: Abstract

Cited by 158 publications

References 11 publications

Load Balancing for Regular Meshes on SMPs with MPI

Load Balancing for Regular Meshes on SMPs with MPI

Multi-level parallelism for incompressible flow computations on GPU clusters

Understanding Network Saturation Behavior on Large-Scale Blue Gene/P Systems

Contact Info

Product

Resources

About