Evaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures

Berlin, Konstantin; Huan, Jun; Jacob, Mary; Kochhar, G.; Prins, Jan F.; Pugh, Bill; Sadayappan, P.; Spacco, Jaime; Tseng, Chau‐Wen

doi:10.1007/978-3-540-24644-2_13

Cited by 14 publications

(13 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Poor performance on distributed memory is consistent with previous UPC evaluation [11]. Program designs that assume efficient access of shared variables do not scale well in systems with higher latency.…”

Section: Parallel Performance On Distributed Memorysupporting

confidence: 80%

UTS: An Unbalanced Tree Search Benchmark

Olivier

Huan

Liu

et al.

Languages and Compilers for Parallel Computing

Self Cite

118

116

View full text Add to dashboard Cite

Abstract. This paper presents an unbalanced tree search (UTS) benchmark designed to evaluate the performance and ease of programming for parallel applications requiring dynamic load balancing. We describe algorithms for building a variety of unbalanced search trees to simulate different forms of load imbalance. We created versions of UTS in two parallel languages, OpenMP and Unified Parallel C (UPC), using work stealing as the mechanism for reducing load imbalance. We benchmarked the performance of UTS on various parallel architectures, including sharedmemory systems and PC clusters. We found it simple to implement UTS in both UPC and OpenMP, due to UPC's shared-memory abstractions. Results show that both UPC and OpenMP can support efficient dynamic load balancing on shared-memory architectures. However, UPC cannot alleviate the underlying communication costs of distributed-memory systems. Since dynamic load balancing requires intensive communication, performance portability remains difficult for applications such as UTS and performance degrades on PC clusters. By varying key work stealing parameters, we expose important tradeoffs between the granularity of load balance, the degree of parallelism, and communication costs.

show abstract

Section: Parallel Performance On Distributed Memorysupporting

confidence: 80%

UTS: An Unbalanced Tree Search Benchmark

Olivier

Huan

Liu

et al.

Languages and Compilers for Parallel Computing

Self Cite

118

116

View full text Add to dashboard Cite

show abstract

“…A previous study on the comparison of OpenMP, MPI, and Pthreads [16] focused on performance for sparse integer codes with irregular remote memory accesses. Other recent papers [17,18] conduct a comparison of OpenMP versus MPI on a specific architecture, the IBM SP3 NH2, for a set of NAS benchmark applications (FT, CG, MG).…”

Section: Related Workmentioning

confidence: 99%

Exploiting Fine-Grained Parallelism in the Phylogenetic Likelihood Function with MPI, Pthreads, and OpenMP: A Performance Study

Stamatakis

Ott

2008

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Emerging multi-and many-core computer architectures pose new challenges with respect to efficient exploitation of parallelism. In addition, it is currently not clear which might be the most appropriate parallel programming paradigm to exploit such architectures, both from the efficiency as well as software engineering point of view. Beyond that, the application of high performance computing techniques and the use of supercomputers will be essential to deal with the explosive accumulation of sequence data. We address these issues via a thorough performance study by example of RAxML, which is a widely used Bioinformatics application for large-scale phylogenetic inference under the Maximum Likelihood criterion. We provide an overview over the respective parallelization strategies with MPI, Pthreads, and OpenMP and assess performance for these approaches on a large variety of parallel architectures. Results indicate that there is no universally best-suited paradigm with respect to efficiency and portability of the ML function. Therefore, we suggest that the ML function should be parallelized with MPI and Pthreads based on software engineering criteria as well as to enforce data locality.

show abstract

“…The FT benchmark is designed to aggressively overlap communication with computation [3], and FT-pencils is a variant of the benchmark that issues smaller messages for better overlap. The implementation of CG is described in [4], and gups is a version of the HPCS RandomAccess benchmark that uses bulk communication. Cfd is an application that solves the time dependent Euler equations for computational fluid flow in a rectangular computational domain, with the high level data structures and algorithms implemented in UPC.…”

Section: Effectiveness Of Communication Aggregationmentioning

confidence: 99%

Automatic nonblocking communication for partitioned global address space programs

Chen

Bonachea

Iancu

et al. 2007

Proceedings of the 21st Annual International Conference on Supercomputing

View full text Add to dashboard Cite

Overlapping communication with computation is an important optimization on current cluster architectures; its importance is likely to increase as the doubling of processing power far outpaces any improvements in communication latency. PGAS languages offer unique opportunities for communication overlap, because their one-sided communication model enables low overhead data transfer. Recent results have shown the value of hiding latency by manually applying language-level nonblocking data transfer routines, but this process can be both tedious and error-prone. In this paper, we present a runtime framework that automatically schedules the data transfers to achieve overlap. The optimization framework is entirely transparent to the user, and aggressively reorders and aggregates both remote puts and gets. We preserve correctness via runtime conflict checks and temporary buffers, using several techniques to lower the overhead. Experimental results on application benchmarks suggest that our framework can be very effective at hiding communication latency on clusters, improving performance over the blocking code by an average of 16% for some of the NAS Parallel Benchmarks, 48% for GUPS, and over 25% for a multi-block fluid dynamics solver. While the system is not yet as effective as aggressive manual optimization, it increases programmers' productivity by freeing them from the details of communication management.

show abstract

Evaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures

Cited by 14 publications

References 6 publications

UTS: An Unbalanced Tree Search Benchmark

UTS: An Unbalanced Tree Search Benchmark

Exploiting Fine-Grained Parallelism in the Phylogenetic Likelihood Function with MPI, Pthreads, and OpenMP: A Performance Study

Automatic nonblocking communication for partitioned global address space programs

Contact Info

Product

Resources

About