Workload Decomposition Strategies for Shared Memory Parallel Systems with OpenMP

Recent Advances in Parallel Virtual Machine and Message Passing Interface

Fogaccia

et al. 2003

Self Cite

Section: Mpi Implementation Of the Inter-node Domain Decompositionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Hierarchical MPI+OpenMP Implementation of Parallel PIC Applications on Clusters of Symmetric MultiProcessors

Briguglio

Recent Advances in Parallel Virtual Machine and Message Passing Interface

Fogaccia

et al. 2003

Self Cite

“…Conversely, the domain decomposition does not require a memory waste, while presenting particle migration between different portions of the domain, which causes communication overheads and the need for dynamic load balancing [4,7]. Such workload decomposition strategies can be applied both for distributedmemory parallel systems [7,6] and shared-memory ones [5]. They can also be combined, when porting a PIC code on a hierarchical distributed-shared memory system (e.g., a cluster of SMPs), in two-level strategies: a distributed-memory level decomposition (among the n node computational nodes), and a shared-memory one (among the n proc processors of each node).…”

Section: Mpi Implementation Of the Inter-node Domain Decompositionmentioning

confidence: 99%

A Grid‐Based Distributed Simulation of Plasma Turbulence

Briguglio¹,

High‐Performance Computing

Fogaccia³

et al. 2005

Self Cite

Grid technology is widespreading, but most grid-enabled applications just exploit shared storage resources rather than computational ones, or utilize static remote allocation mechanisms of Grid platforms. In this paper the porting on a Globus equipped platform of a hierarchically distributed-shared memory parallel version of an application for particle-in-cell (PIC) simulation of plasma turbulence is described, based on the hierarchical integration of MPI and OpenMP, and originally developed for generic (non Grid) clusters of SMP nodes. INTRODUCTIONGrid technology is gaining more and more widespread diffusion within scientific community. Despite the original motivations behind many Grid initiatives, that is sharing and cooperatively using computational resources scattered through the globe in a transparent way with respect to their physical location, most grid-enabled applications just exploit shared storage resources rather than computational ones, or utilize static remote allocation mechanisms in order to transparently select and allocate sequential or parallel tasks, running, in any case, on a single grid node. Nowadays most Grid platforms are able to present computational and storage resources spread all over the world and managed by different entities, as they were a single, virtually homogeneous, parallel machine. Parallel tasks and applications could, at least in principle, be managed and executed over widespread, heterogeneous platforms.

“…The relevant portion of the pressure updating extrinsic procedure described in Section 5 then becomes p_par = 0. !$OMP parallel do private(l,j_r,j_theta,j_phi) do l = 1,UBOUND(r,dim = 1) j_r = f_r(r(l)) j_theta = f_theta(theta(l)) j_phi = f_phi(phi(l)) !$OMP critical p_par(j_r,j_theta,j_phi,1) = & p_par(j_r,j_theta,j_phi,1) & + h(r(l),...,w(l)) !$OMP end critical enddo !$OMP end parallel do Unfortunately, the intra-node serialization induced by the protected critical section on the shared access to the array p par represents a bottleneck that heavily affects the performances (almost no speedup) [14]. Such a bottleneck can be eliminated, at the expense of memory occupation, by means of an alternative strategy, analogous to that envisaged within the framework of the inter-node decomposition, which relies on the associative and distributive properties of the updating laws for the pressure array with respect to the contributions given by every single particle: the computation for each update is split among the threads into partial computations, each of them involving only the contribution of the particles managed by the responsible thread; then the partial results are reduced into global ones.…”

Section: Particle Decomposition Strategymentioning

confidence: 99%

Workload decomposition strategies for hierarchical distributed‐shared memory parallel systems and their implementation with integration of high‐level parallel languages

Briguglio

Concurrency and Computation

Vlad

2002

Self Cite

SUMMARYIn this paper we address the issue of workload decomposition in programming hierarchical distributedshared memory parallel systems. The workload decomposition we have devised consists of a two-stage procedure: a higher-level decomposition among the computational nodes; and a lower-level one among the processors of each computational node. By focusing on porting of a case study particle-in-cell application, we have implemented the described work decomposition without large programming effort by using and integrating the high-level language extensions High-Performance Fortran and OpenMP.