Reducing data communication overhead for DOACROSS loop nests

IEEE Trans. Parallel Distrib. Syst.

2009

Abstract-In this paper we revisit the supernode-shape selection problem, that has been widely discussed in bibliography. In general, the selection of the supernode transformation greatly affects the parallel execution time of the transformed algorithm. Since the minimization of the overall parallel execution time via an appropriate supernode transformation is very difficult to accomplish, researchers have focused on scheduling-aware supernode transformations that maximize parallelism during the execution. In this paper we argue that the communication volume of the transformed algorithm is an important criterion, and its minimization should be given high priority. For this reason we define the metric of the per process communication volume and propose a method to miminize this metric by selecting a communication-aware supernode shape. Our approach is equivalent to defining a proper Cartesian process grid with MPI Cart Create, which means that it can be incorporated in applications in a straightforward manner. Our experimental results illustrate that by selecting the tile shape with the proposed method, the total parallel execution time is significantly reduced due to the minimization of the communication volume, despite the fact that a few more parallel execution steps are required.

Section: Scheduling Mapping and Parallel Execution Timementioning

confidence: 99%

Communication-Aware Supernode Shape

Goumas

IEEE Trans. Parallel Distrib. Syst.

2009

“…Consequently, only neighboring processes need to communicate assuming reasonably coarse parallel granularities, taking into account that distributed memory architectures are addressed. According to the above, we only consider unitary process communication directions for our analysis, since all other non-unitary process dependencies can be satisfied according to indirect message passing techniques, such as the ones described in [19]. However, in order to preserve the communication pattern of the application, we consider a weight factor d i for each process dependence direction i, implying that if iteration j = (j 1 , .…”

Section: Algorithmic Modelmentioning

confidence: 99%

“…All message passing communication is performed outside of the parallel region (lines 5-9 and 20-23), while the multi-threading parallel computation occurs in lines [10][11][12][13][14][15][16][17][18][19]. Note that no explicit barrier is required for thread synchronization, as this effect is implicitly achieved by exiting the multi-threading parallel region.…”

Section: Fine-grain Hybrid Modelmentioning

confidence: 99%

The Effect of Process Topology and Load Balancing on Parallel Programming Models for SMP Clusters and Iterative Algorithms

2006

J Supercomput

This article focuses on the effect of both process topology and load balancing on various programming models for SMP clusters and iterative algorithms. More specifically, we consider nested loop algorithms with constant flow dependencies, that can be parallelized on SMP clusters with the aid of the tiling transformation. We investigate three parallel programming models, namely a popular message passing monolithic parallel implementation, as well as two hybrid ones, that employ both message passing and multi-threading. We conclude that the selection of an appropriate mapping topology for the mesh of processes has a significant effect on the overall performance, and provide an algorithm for the specification of such an efficient topology according to the iteration space and data dependencies of the algorithm. We also propose static load balancing techniques for the computation distribution between threads, that diminish the disadvantage of the master thread assuming all inter-process communication due to limitations often imposed by the message passing library. Both improvements are implemented as compile-time optimizations and are further experimentally evaluated. An overall comparison of the above parallel programming styles on SMP clusters based on micro-kernel experimental evaluation is further provided, as well.

“…i , 1 ≤ j ≤ m}, and implementing the indirect message passing techniques discussed in [16]. The goal of this paper is to determine a rectangular tiling transformation, that minimizes the communication volume of a typical, non-boundary process during the parallel execution of the tiled iteration space.…”

Section: Definition Of the Problemmentioning

confidence: 99%

Selecting the tile shape to reduce the total communication volume

Goumas

Proceedings 20th IEEE International Parallel &Amp; Distributed Processing Symposium

2006

In this paper we revisit the tile-shape selection problem, that has been extensively discussed in bibliography. An efficient approach is proposed for the selection of a suitable tile shape, based on the minimization of the process communication volume. We consider the large family of applications that arise from the discretization of partial differential equations (PDEs). Practical experience has shown that for such applications and distributed memory architectures, minimizing the total communication volume is more important than minimizing the total number of parallel execution steps. We formulate a new method to determine an appropriate communication-aware tile shape, i.e. the one that reduces the communication volume for a fixed number of processes. Our approach is equivalent to defining a proper Cartesian process grid with MPI Cart Create, which means that it can be incorporated in applications in a straightforward manner. Our experimental results illustrate that by selecting the tile shape with the proposed method, the total parallel execution time is significantly reduced due to the minimization of the communication volume, despite the fact that a few more parallel execution steps are required.