In the paper we present a framework for partitioning data parallel computations across a heterogeneous metasystem at runtime. The framework is guided by program and resource information which is made available to the system. Three difficult problems are handled by the framework: processor selection, task placement and heterogeneous data domain decomposition. Solving each of these problems contributes to reduced elapsed time. In particular, processor selection determines the best gain size at which to run the computation, task placement reduces communication cost, and data domain decomposition achieves processor load balance. We present results which indicate that excellent performanceis achievableusing the framework. The paper extends our earlier work on partitioning data parallel computations across a singlelevel network of heterogeneous workstations.
INTRODUCTIONA great deal of recent interest has been sparked within academic, industrial and government circles in the emerging technology of metmystem-based high-performance computing. A metasystern is a shared ensemble of workstations, vector and parallel machines connected by local-and wide-area networks (see Figure I). The promise of on-line gigabit networks coupled with the tremendous computing power of the metasystem makes it very attractive for parallel computations.The potentially large array of heterogeneous resources in the metasystem offers an opportunity for delivering high performance on a range of parallel computations. Choosing the best set of available resources is a difficult problem and is the subject of this paper. Consider the set of machines in Table 1 and observe that they have different computation and communication capacities. Loosely coupled parallel computations with infrequent communication would probably benefit by applying the fastest set of computational resources (perhaps the DEC-Alphacluster), and may benefit from distributionacross many machines. On the other hand, more tightly coupled parallel computations are best suited to machines that have a higher communication capacity (perhaps an Intel Paragon), but may also benefit from distribution across many machines if the computation granularity is sufficient. We address the latter problem in this paper.We present a framework that automates partitioning and placement of data parallel computations across metasystems such as in Figure 1. Partitioning is performed at runtime when the state of the metasystem resources are known. Three difficult problems are handled by the framework: processor selection, task placement and heterogeneous data domain decomposition. Solving each of these problems contributes to reduced completion time. Processor selection chooses the best number and type of processors to apply to the computation. This el = k = T = the i " ' network cluster the irh processor cluster application communication topology message size in bytes communication cost coefficients processor-dependent communication function router cost constants coercion cost constant number of messages th...