A Microbenchmark Suite for OpenMP Tasks

Bull, J. Mark; Reid, Fiona; McDonnell, Nicola

doi:10.1007/978-3-642-30961-8_24

Cited by 52 publications

(17 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally we conclude that the above effects, particularly the lost parallelism with colored execution, combined with threading overheads [24], [25] may account for the slightly worse performance of the hybrid MPI+OpenMP approach on a single node compared to the MPI only runtime. Indeed, many PDE codes written by threading experts [26] are found to run faster with flat MPI.…”

Section: Openmp Executionmentioning

confidence: 99%

Acceleration of a Full-Scale Industrial CFD Application with OP2

Reguly

Mudalige

Bertolli

et al. 2016

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Hydra is a full-scale industrial CFD application used for the design of turbomachinery at Rolls Royce plc. It consists of over 300 parallel loops with a code base exceeding 50K lines and is capable of performing complex simulations over highly detailed unstructured mesh geometries. Unlike simpler structured-mesh applications, which feature high speed-ups when accelerated by modern processor architectures, such as multi-core and many-core processor systems, Hydra presents major challenges in data organization and movement that need to be overcome for continued high performance on emerging platforms. We present research in achieving this goal through the OP2 domain-specific high-level framework. OP2 targets the domain of unstructured mesh problems and follows the design of an active library using source-to-source translation and compilation to generate multiple parallel implementations from a single high-level application source for execution on a range of back-end hardware platforms. We chart the conversion of Hydra from its original hand-tuned production version to one that utilizes OP2, and map out the key difficulties encountered in the process. To our knowledge this research presents the first application of such a high-level framework to a full scale production code. Specifically we show (1) how different parallel implementations can be achieved with an active library framework, even for a highly complicated industrial application such as Hydra, and (2) how different optimizations targeting contrasting parallel architectures can be applied to the whole application, seamlessly, reducing developer effort and increasing code longevity. Performance results demonstrate that not only the same runtime performance as that of the hand-tuned original production code could be achieved, but it can be significantly improved on conventional processor systems. Additionally, we achieve further acceleration by exploiting many-core parallelism, particularly on GPU systems. Our results provide evidence of how high-level frameworks such as OP2 enable portability across a wide range of contrasting platforms and their significant utility in achieving near-optimal performance without the intervention of the application programmer.

show abstract

Section: Openmp Executionmentioning

confidence: 99%

Acceleration of a Full-Scale Industrial CFD Application with OP2

Reguly

Mudalige

Bertolli

et al. 2016

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…Graph coloring color growth bounded graphs such as unit disk graphs; for resolving resource conflicts. Our study of existing benchmark suites [4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] has found that none of them meet majority of the aforementioned key requirements. Our goal is to design a benchmark suite that meets all our stated requirements.…”

Section: Problemmentioning

confidence: 99%

IMSuite: A benchmark suite for simulating distributed algorithms

Gupta

Nandivada

2015

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Considering the diverse nature of real-world distributed applications that makes it hard to identify a representative subset of distributed benchmarks, we focus on their underlying distributed algorithms. We present and characterize a new kernel benchmark suite (named IMSuite) that simulates some of the classical distributed algorithms in task parallel languages. We present multiple variations of our kernels, broadly categorized under two heads: (a) varying synchronization primitives (with and without fine grain synchronization primitives); and (b) varying forms of parallelization (data parallel and recursive task parallel). Our characterization covers interesting aspects of distributed applications such as distribution of remote communication requests, number of synchronization, task creation, task termination and atomic operations. We study the behavior (execution time) of our kernels by varying the problem size, the number of compute threads, and the input configurations. We also present an involved set of input generators and output validators.

show abstract

“…We compare against MPI-only and hybrid MPI+OpenMP performance. We used a modified version of the EPCC Syncbench [113] for barrier and reduction (accumulator) tests.…”

Section: Resultsmentioning

confidence: 99%

Runtime Systems for Extreme Scale Platforms

Chatterjee¹

2013

View full text Add to dashboard Cite

Future extreme-scale systems are expected to contain homogeneous and heterogeneous many-core processors, with O(10 3 ) cores per node and O(10 6 ) nodes overall.Effective combination of inter-node and intra-node parallelism is recognized to be a major software challenge for such systems. Further, applications will have to deal with constrained energy budgets as well as frequent faults and failures. To aid programmers manage these complexities and enhance programmability, much of recent research has focussed on designing state-of-art software runtime systems. Such runtime systems are expected to be a critical component of the software ecosystem for the management of parallelism, locality, load balancing, energy and resilience on extreme-scale systems.In this dissertation, we address three key challenges faced by a runtime system using a dynamic task parallel framework for extreme-scale computing. First, we address the challenge of integrating an intra-node task parallel runtime with a communication system for scalable performance. We present a runtime communication system, called HC-COMM, designed to use dedicated communication cores on a system. We introduce the HCMPI programming model which integrates the Habanero-C asynchronous dynamic task parallel language with the MPI message passing communication model on the HC-COMM runtime. We also introduce the HAPGNS model that enables data flow programming for extreme-scale systems in which the user does not require knowledge of MPI. Second, we address the challenge of separating locality optimizations from a programmer with domain specific knowledge. We present a tuning framework, through which performance experts can optimize existing applications by specifying runtime operations aimed at co-scheduling of affinitized tasks. Finally, we address the challenge of scalable synchronization for long running tasks on a dynamic task parallel runtime. We use the phaser construct to present a generalized tree-based synchronization algorithm and support unified collective operations at both inter-node and intra-node levels. Overcoming these runtime challenges are a first step towards effective programming on extreme-scale systems. AcknowledgmentsIt was an honor and a gift to have had Prof. Vivek Sarkar as my PhD advisor.Working with him has been a truly great learning experience for me.

show abstract

A Microbenchmark Suite for OpenMP Tasks

Cited by 52 publications

References 3 publications

Acceleration of a Full-Scale Industrial CFD Application with OP2

Acceleration of a Full-Scale Industrial CFD Application with OP2

IMSuite: A benchmark suite for simulating distributed algorithms

Runtime Systems for Extreme Scale Platforms

Contact Info

Product

Resources

About