Performance Analysis Techniques for Task-Based OpenMP Applications

Schmidl, Dirk; Philippen, Peter; Lorenz, Daniel; Rössel, Christian; Geimer, Markus; Mey, Dieter an; Mohr, Bernd; Wolf, Felix

doi:10.1007/978-3-642-30961-8_15

Cited by 19 publications

(16 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Schmidl et al [16] described possible performance problems with OpenMP tasks and visualized trace data of tasks with Vampir [7]. However, manually searching a time-line visualization for certain performance patterns is tedious and time consuming.…”

Section: Related Workmentioning

confidence: 98%

See 1 more Smart Citation

Profiling of OpenMP Tasks with Score-P

Lorenz¹,

Philippen²,

Schmidl

et al. 2012

2012 41st International Conference on Parallel Processing Workshops

Self Cite

View full text Add to dashboard Cite

With the task construct, the OpenMP 3.0 specification introduces an additional level of parallelism that challenges established schemes of performance profiling. First, a thread may execute a sequence of interleaved task fragments the profiling system must properly distinguish to enable correct performance analyses. Furthermore, the additional parallelization dimension requires new visualization methods for presenting analysis results. Finally, as a new programming paradigm, tasking implicitly introduces paradigm-specific performance issues and creates a need for corresponding optimization strategies. This paper presents solutions to overcome the challenges of profiling applications based on OpenMP tasks. Second, the paper describes metrics that may help uncover performance problems related to tasking. We present an implementation of our solution within the Score-P performance measurement system, which we evaluate using the Barcelona OpenMP Task Suite.

show abstract

Section: Related Workmentioning

confidence: 98%

“…This comes at the cost of additional task management overhead. Schmidl et al [16] identified three performance issues specifically related to OpenMP tasks:…”

Section: Problem Analysismentioning

confidence: 99%

Profiling of OpenMP Tasks with Score-P

Lorenz¹,

Philippen²,

Schmidl

et al. 2012

2012 41st International Conference on Parallel Processing Workshops

Self Cite

View full text Add to dashboard Cite

show abstract

“…In our fourth use case, we want to evaluate a task-based parallelization problem. One of the most common performance analysis targets is to identify tasks with inappropriate granularity [17]. For this purpose, we use an artificial program that has two task constructs.…”

Section: Task Granularitymentioning

confidence: 99%

Preventing the explosion of exascale profile data with smart thread-level aggregation

Lorenz

Shudler

Wolf

2015

Proceedings of the 4th Workshop on Extreme Scale Programming Tools

Self Cite

View full text Add to dashboard Cite

State of the art performance analysis tools, such as Score-P, record performance profiles on a per-thread basis. However, for exascale systems the number of threads is expected to be in the order of a billion threads, and this would result in extremely large performance profiles. In most cases the user almost never inspects the individual per-thread data. In this paper, we propose to aggregate per-thread performance data in each process to reduce its amount to a reasonable size. Our goal is to aggregate the threads such that the thread-level performance issues are still visible and analyzable. Therefore, we implemented four aggregation strategies in Score-P: (i) SUM -aggregates all threads of a process into a process profile; (ii) SET -calculates statistical key data as well as the sum; (iii) KEY -identifies three threads (i.e., key threads) of particular interest for performance analysis and aggregates the rest of the threads; (iv) CALLTREE -clusters threads that have the same call-tree structure. For each one of these strategies we evaluate the compression ratio and how they maintain thread-level performance behavior information. The aggregation does not incur any additional performance overhead at application run-time. General TermsAlgorithms, experientation, measurement Keywords Performance analysis, data compression, exascale computing Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. ESPT2015

show abstract

“…Olivier et al [15] compared their scheduler in Qthreads with Intel's and GCC's OpenMP implementations concerning multi-socket SMPs. Schmidl et al [18] proposed a task-event model that helps profile performance on task-centric applications. Addison et al [16] created the OpenMP implementation in the compiler Open64 [17].…”

Section: Related Workmentioning

confidence: 99%

A comparative performance study of common and popular task‐centric programming frameworks

Podobas

Brorsson

Faxén

2013

Concurrency and Computation

View full text Add to dashboard Cite

Programmers today face a bewildering array of parallel programming models and tools, making it difficult to choose an appropriate one for each application. An increasingly popular programming model supporting structured parallel programming patterns in a portable and composable manner is the task-centric programming model. In this study, we compare several popular task-centric programming frameworks, including Cilk Plus, Threading Building Blocks, and various implementations of OpenMP 3.0. We have analyzed their performance on the Barcelona OpenMP Tasking Suite benchmark suite both on a 48-core AMD Opteron 6172 server and a 64-core TILEPro64 embedded many-core processor. Our results show that the OpenMP offers the highest flexibility for programmers, and this flexibility comes to a cost. Frameworks supporting only a specific and more restrictive model, such as Cilk Plus and Threading Building Blocks, are generally more efficient both in terms of performance and energy consumption. However, Intel's implementation of OpenMP tasks performs the best and closest to the specialized run-time systems. Mercurium is the source-to-source compiler used in conjunction with Nanos++. † † Because the (not-entirely unexpected) lack of a TILEPro64 back-end in the Intel compiler. ‡ ‡ This means that to synchronize with N tasks, the programmer need to explicitly use SYNC N times. PERFORMANCE IN TASK-CENTRIC PROGRAMMING FRAMEWORKS 13 Figure 5. Memory footprint for each run-time system implementation when normalized against a serial execution. The base is the serial executing compiled with GCC.example, Fibonacci and N-queens. Intel's TBB is the implementation, which has the largest memory footprint of all models.4.4.6. Embedded power measurements. Power consumption and energy has risen to become as important metrics as performance, in particular on embedded devices. We have measured the power and energy consumed by the application under different run-time systems on the TILEPro64. The reason we have not performed for the Opteron systems is because we found it to be much more difficult to isolate the effect of the processors and memory on that machine, although it was relatively straight-forward on the TILEPro64. We used a data acquisition device (NI USB-6210) to perform the power measurements on the TILEPro64. Sampling the power consumption did in no way interfere with the program execution, as it was performed on a separate computer connected to the data acquisition device. This set-up is similar to the ones used by Själander et al. [44]. The measured sampling frequency was 20 kHZ. We used a metric that we call speed-up power cost, which calculates the speed-up and application experiences for each added watt. EXPERIMENTAL RESULTS AND DISCUSSIONThis section presents the experimental results obtained from micro-benchmarks and other benchmarks according to the methodology from the previous section. Micro-benchmarksAll the micro-benchmark measurements were performed on the Opteron 6172 48-core system. A. PODOBAS, M. BRORSSON AND K.-F. FA...

show abstract

Performance Analysis Techniques for Task-Based OpenMP Applications

Cited by 19 publications

References 11 publications

Profiling of OpenMP Tasks with Score-P

Profiling of OpenMP Tasks with Score-P

Preventing the explosion of exascale profile data with smart thread-level aggregation

A comparative performance study of common and popular task‐centric programming frameworks

Contact Info

Product

Resources

About