In this work we present a microbenchmark methodology for assessing the overheads associated with nested parallelism in OpenMP. Our techniques are based on extensions to the well known EPCC microbenchmark suite that allow measuring the overheads of OpenMP constructs when they are effected in inner levels of parallelism. The methodology is simple but powerful enough and has enabled us to gain interesting insight into problems related to implementing and supporting nested parallelism. We measure and compare a number of commercial and freeware compilation systems. Our general conclusion is that while nested parallelism is fortunately supported by many current implementations, the performance of this support is rather problematic. There seem to exist issues which have not yet been addressed effectively, as most OpenMP systems do not exhibit a graceful reaction when made to execute inner levels of concurrency.
OpenMP can be supported in cluster environments by using distributed shared memory (DSM) systems. A portable approach for building DSM systems is to layer it on MPI. With these goals in mind, this paper makes two contributions. The first is a discussion about two software DSM systems that we have implemented using MPI. One uses background polling threads while the other uses processes that are driven only by incoming MPI messages. Comparisons of the two approaches show the latter to be a more scalable architecture that is better suited for the multi-core processors that are becoming commonplace. The second contribution recognizes that a common workaround for sub-team synchronizations in OpenMP is to use the flush directive on shared variables within busy-wait loops. In such a situation, only the flush in the last iteration of the busy-wait loop will result in the conditions necessary for exiting the loop. Thus transfer of the shared value need only be done if there were changes. We implement in our DSM a flush mechanism that eliminates the unnecessary data transfers entirely without any additional support or hints from the programmer.
The OpenMP shared memory programming paradigm has been widely embraced by the computational science community, as has distributed memory clusters. What are the prospects for running OpenMP applications on clusters? This paper gives an overview of the SCore cluster enabled OpenMP environment, provides performance data for some of the fundamental underlying operations, and reports overall performance for a model computational science application (the finite difference solution of the 2D Laplace equation).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.