High-level management of communication schedules in HPF-like languages

Benkner, Siegfried; Mehrotra, Piyush; Rosendale, John Van; Zima, Hans P.

doi:10.1145/277830.277855

Cited by 24 publications

(13 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…inspector/executor), non-local data access patterns are explicitly specified at runtime based on the concept of halos. Moreover, in both kernels the reuse of runtime generated communication schedules for indirectly accessed arrays is enforced by means of appropriate language features for communication schedule reuse [1,3]. Both kernels have been parallelized with the VFC compiler [3] and executed on a Beowulf cluster consisting of 16 nodes connected via Myrinet.…”

Section: Resultsmentioning

confidence: 99%

“…HPF+ [2] is an extension of HPF with special support for an efficient handling of irregular codes. In addition to the basic features of HPF, it includes generalized block distributions and indirect distributions, dynamic data redistribution, language features for communication schedule reuse [1] and the halo concept [4] for controlling irregular non-local data access patterns. VFC provides powerful parallelization strategies for a large class of non-perfectly nested loops with irregular runtime-dependent access patterns which are common in industrial codes.…”

Section: Overview Of Vfcmentioning

confidence: 99%

See 1 more Smart Citation

Language and Compiler Support for Hybrid-Parallel Programming on SMP Clusters

Benkner

Sipkova

2002

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. In this paper we present HPF extensions for clusters of SMPs and their implementation within the VFC compiler. The main goal of these extensions is to optimize HPF for clusters of SMPs by enhancing the functionality of the mapping mechanisms and by providing the user with high-level means for controlling key aspects of distributedmemory and shared-memory parallelization. Based on the proposed language extensions, the VFC compiler adopts a hybrid parallelization strategy which closely reflects the hierarchical structure of SMP clusters by exploiting shared-memory parallelism based on OpenMP within nodes and distributed-memory parallelism utilizing MPI across nodes. We describe the language extensions, outline the hybrid parallelization strategy of VFC and present experimental results which show the effectiveness of these techniques.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Overview Of Vfcmentioning

confidence: 99%

Language and Compiler Support for Hybrid-Parallel Programming on SMP Clusters

Benkner

Sipkova

2002

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…HPF introduces parallelism primarily with vector operations, which, in order to archive good performance, must be aligned by the user to reduce communication. However, a lot of work has been put into eliminating this alignment issue either at compile-time or runtime [6] [7] [8].…”

Section: Related Workmentioning

confidence: 99%

Managing Communication Latency-Hiding at Runtime for Parallel Programming Languages and Libraries

Vinter

2012

2012 IEEE 14th International Conference on High Performance Computing and Communication &Amp; 2012 IEEE 9th International Confe

View full text Add to dashboard Cite

This work introduces a runtime model for managing communication with support for latency-hiding. The model enables non-computer science researchers to exploit communication latency-hiding techniques seamlessly. For compiled languages, it is often possible to create efficient schedules for communication, but this is not the case for interpreted languages. By maintaining data dependencies between scheduled operations, it is possible to aggressively initiate communication and lazily evaluate tasks to allow maximal time for the communication to finish before entering a wait state. We implement a heuristic of this model in DistNumPy, an auto-parallelizing version of numerical Python that allows sequential NumPy programs to run on distributed memory architectures. Furthermore, we present performance comparisons for six benchmarks with and without automatic latency-hiding. The results shows that our model reduces the time spent on waiting for communication as much as 27 times, from a maximum of 54% to only 2% of the total execution time, in a stencil application.

show abstract

“…This has given rise to an inspector-executor paradigm where an inspector routine does a run-time analysis of array accesses for communication and derives a communication schedule that is then used by the executor routine to actually perform the communication. Since the inspector routine can be very time-consuming, there has been work done on minimizing its overhead and reusing any schedules produced [3].…”

Section: Discussionmentioning

confidence: 99%

Cluster performance and the implications for distributed, heterogeneous grid performance

Lee¹,

DeMatteis²,

Stepanek³

et al.

Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556)

View full text Add to dashboard Cite

This paper examines the issues surrounding efficient execution in heterogeneous grid environments. The performance of a Linux cluster and a parallel supercomputer is initially compared using both benchmarks and an application. With an understanding of how benchmark and application performance is affected by processor and interconnect speed, a comparison is made with the bandwidth and latencies available in a grid testbed. Of significant concern is the fact that the available communication bandwidth and latencies have a dynamic range of 3 to 4 orders of magnitude while processor speeds have a range of about one half order of magnitude. Also, while both processor speed and network bandwidth are increasing very rapidly, simple propagation delay will become more significant in the network latencies seen by many grid applications. That is to say, the pipes in a grid will be getting fatter but not commensurately shorter. How are we to effectively utilize such an infrastructure? Clearly an attractive approach is to require sufficient concurrency in the application such that a coarsegrain, data-driven model of execution can be used to hide latencies while hopefully keeping context switching overheads low. If the "spatial component" of an application is understood, then runtime systems could also apply established techniques like caching, compression, estimation and speculative pre-fetching. Ideally this low-level performance management should be encapsulated in an easy-to-use abstraction.

show abstract

High-level management of communication schedules in HPF-like languages

Cited by 24 publications

References 11 publications

Language and Compiler Support for Hybrid-Parallel Programming on SMP Clusters

Language and Compiler Support for Hybrid-Parallel Programming on SMP Clusters

Managing Communication Latency-Hiding at Runtime for Parallel Programming Languages and Libraries

Cluster performance and the implications for distributed, heterogeneous grid performance

Contact Info

Product

Resources

About