Employing nested OpenMP for the parallelization of multi-zone computational fluid dynamics applications

Ayguadé, Eduard; González, Marc; Martorell, Xavier; Jost, Gabriele

doi:10.1109/ipdps.2004.1302905

Cited by 15 publications

(18 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Binding, however, is non-portable from the performance point of view. In order to favor affinities in a more portable manner, the NANOS compiler [DGC05,AGMJ04] allows to associate groups of threads with parallel regions in a static way. The OpenUH Compiler [CHJ + 06] proposes a mecanism to accurately select the threads of a subteam, although this proposition does not involve nested parallelism.…”

Section: Related Workmentioning

confidence: 99%

Scheduling Dynamic OpenMP Applications over Multicore Architectures

Broquedis

Diakhaté

Thibault

et al. 2008

OpenMP in a New Era of Parallelism

View full text Add to dashboard Cite

Abstract. Approaching the theoretical performance of hierarchical multicore machines requires a very careful distribution of threads and data among the underlying non-uniform architecture in order to minimize cache misses and NUMA penalties. While it is acknowledged that OpenMP can enhance the quality of thread scheduling on such architectures in a portable way, by transmitting precious information about the affinities between threads and data to the underlying runtime system, most OpenMP runtime systems are actually unable to efficiently support highly irregular, massively parallel applications on NUMA machines. In this paper, we present a thread scheduling policy suited to the execution of OpenMP programs featuring irregular and massive nested parallelism over hierarchical architectures. Our policy enforces a distribution of threads that maximizes the proximity of threads belonging to the same parallel section, and uses a NUMA-aware work stealing strategy when load balancing is needed. It has been developed as a plug-ins to the FORESTGOMP OpenMP platform [TBG + 07]. We demonstrate the efficiency of our approach with a highly irregular recursive OpenMP program resulting from the generic parallelization of a surface reconstruction application. We achieve a speedup of 14 on a 16-core machine with no application-level optimization.

show abstract

Section: Related Workmentioning

confidence: 99%

Scheduling Dynamic OpenMP Applications over Multicore Architectures

Broquedis

Diakhaté

Thibault

et al. 2008

OpenMP in a New Era of Parallelism

View full text Add to dashboard Cite

show abstract

“…In order to motivate the proposal of this paper, we summarize in this section the main observations and conclusions of previous research works [4,5,2,13,3]. Figure 1 shows a simplified version of the structure of BT-MZ, one of the codes included in the NAS multizone benchmarks [9].…”

Section: Motivationmentioning

confidence: 99%

“…In the example, this is done by having different values for the argument of the NUM THREADS clause, depending on the zone that a set of threads is going to work on. In order to illustrate the impact of this thread clustering strategy, Figure 2 shows how the execution time of the BT-MZ application changes with different allocations of threads in the outer level (NP) and inner level (NT) [3]. NT represents the average value as its exact value is different for each zone since their have different sizes in this application.…”

Section: Motivationmentioning

confidence: 99%

“…Experiments with nested parallelism [2,5] pointed out the main problems to get a good performance. Several authors [13,4,3] have argued that a thread distribution will solve these problems. The thread grouping mechanism was studied and organized as a proposal to extend the OpenMP language [2]: constructs were defined to allow programmer specify the most appropriate thread distribution between the levels of parallelism and.…”

Section: Related Workmentioning

confidence: 99%

“…some researchers have used this framework to parallelize their applications using nested parallelism [12,3]. Some extensions to the OpenMP specification have been proposed to efficiently exploit nested parallelism and allow programmers to specify the appropriate allocation of resources to the different levels of parallelism [13].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Automatic thread distribution for nested parallelism in OpenMP

Durán¹,

González²,

Corbalan³

2005

Proceedings of the 19th Annual International Conference on Supercomputing

View full text Add to dashboard Cite

OpenMP is becoming the standard programming model for shared-memory parallel architectures. One of its most interesting features in the language is the support for nested parallelism. Previous research and parallelization experiences have shown the benefits of using nested parallelism as an alternative to combining several programming models such as MPI and OpenMP. However, all these works rely on the manual definition of an appropriate distribution of all the available thread across the different levels of parallelism. Some proposals have been made to extend the OpenMP language to allow the programmers to specify the thread distribution. This paper proposes a mechanism to dynamically compute the most appropriate thread distribution strategy. The mechanism is based on gathering information at runtime to derive the structure of the nested parallelism. This information is used to determine how the overall computation is distributed between the parallel branches in the outermost level of parallelism, which is constant in this work. According to this, threads in the innermost level of parallelism are distributed.The proposed mechanism is evaluated in two different environments: a research environment, the Nanos OpenMP research platform, and a commercial environment, the IBM XL runtime library. The performance numbers obtained validate the mechanism in both environments and they show the importance of selecting the proper amount of parallelism in the outer level.

show abstract

What Multilevel Parallel Programs Do When You Are Not Watching: A Performance Analysis Case Study Comparing MPI/OpenMP, MLP, and Nested OpenMP

Jost

Labarta²,

Giménez³

2005

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Extended AbstractWith the currenr trend h padkl cxmpckr xchitpctllrps towards clusters of shared memory symmetric multi-processors, parallel programming techniques have evolved that support parallelism beyond a single level. When comparing the performance of applications based on different programming paradigms, it is important to differentiate between the inftuence of the programming model itself and other factors, such as implementation specific behavior of the operating system (OS) or architectural issues. Rewriting-a large scientific application in order to employ a new programming parad i m is usually a time consuming and error prone task. Before embarking on such an endeavor it is important to determine that there is really a gain that would not be possibie wirh &e Luiierit iiiip!c;..eatxici. . 4 drt2iled performance analysis is crucial to clarify these issues.The multilevel programming paradi=ps considered in this study are hybrid MPVOpenMP, MLP, and nested Openh4P. The hybrid MPYOpenME' approach is based on using MPI [7] for the coarse grained parallelization and OpenMP 191 for fine grained loop level parallelism. The MPI programming paradim assumes a private address space for each process. Data is transferred by explicitly exchanging messages via calls to the MPI library. This model was originally designed for distributed memory architectures but is also suitable for shared memory systems. The second paradi,p under consideration is MLP which was developed by Taft [ 1 11. The approach is similar to MPUOpenMP, using a mix of coarse grain process level paralleliiation and loop level OpenMP parallelization. As it is the case with MPI, a private address space is assumed for each process. The MLP approach was developed for ccNUMA architectures and explicitly takes advantage of the availability of shared memory. A shared memory arena which is accessible by all processes is required. Communication is done by reading from and writing to the shared memory. Libraries supporting the MLP paradigm usually provide routines for process creation, shared memory allocation, and

show abstract

Employing nested OpenMP for the parallelization of multi-zone computational fluid dynamics applications

Abstract: In this paper

Cited by 15 publications

References 8 publications

Scheduling Dynamic OpenMP Applications over Multicore Architectures

Scheduling Dynamic OpenMP Applications over Multicore Architectures

Automatic thread distribution for nested parallelism in OpenMP

What Multilevel Parallel Programs Do When You Are Not Watching: A Performance Analysis Case Study Comparing MPI/OpenMP, MLP, and Nested OpenMP

Contact Info

Product

Resources

About