Profiling of OpenMP Tasks with Score-P

Lorenz, Daniel; Philippen, Peter; Schmidl, Dirk; Wolf, Felix

doi:10.1109/icppw.2012.62

Cited by 14 publications

(5 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In some of the benchmarked applications, we observed that the replay time for p = 1 is slightly bigger than the execution time of the original code. This happens due to small perturbation effects of task instrumentation [25]; the impact of this effect, however, is minimal. Table 2 presents the models for T ∞ (n) and π(n) (average parallelism) that were created using the results from the TDG analysis.…”

Section: Analysis Of the Resultsmentioning

confidence: 99%

Isoefficiency in Practice

et al. 2017

Self Cite

View full text Add to dashboard Cite

Task-based programming offers an elegant way to express units of computation and the dependencies among them, making it easier to distribute the computational load evenly across multiple cores. However, this separation of problem decomposition and parallelism requires a sufficiently large input problem to achieve satisfactory efficiency on a given number of cores. Unfortunately, finding a good match between input size and core count usually requires significant experimentation, which is expensive and sometimes even impractical. In this paper, we propose an automated empirical method for finding the isoefficiency function of a taskbased program, binding efficiency, core count, and the input size in one analytical expression. This allows the latter two to be adjusted according to given (realistic) efficiency objectives. Moreover, we not only find (i) the actual isoefficiency function but also (ii) the function one would yield if the program execution was free of resource contention and (iii) an upper bound that could only be reached if the program was able to maintain its average parallelism throughout its execution. The difference between the three helps to explain low efficiency, and in particular, it helps to differentiate between resource contention and structural conflicts related to task dependencies or scheduling. The insights gained can be used to co-design programs and shared system resources.

show abstract

Section: Analysis Of the Resultsmentioning

confidence: 99%

Isoefficiency in Practice

et al. 2017

Self Cite

View full text Add to dashboard Cite

show abstract

“…The former shows stub nodes at execution locations along the main call tree, and the latter shows a subtree for every task construct. The Score-P task profiling mechanism and resulting profile data is explained in [13] in more detail.…”

Section: Task Granularitymentioning

confidence: 99%

“…On the other hand, we spend 607s creating tasks. Former task analysis examples showed that task switches and task completion can require roughly the same amount of execution time which will appear as exclusive execution time in the barrier [13]. In principle, task dependency structures may limit parallelism.…”

Section: Task Granularitymentioning

confidence: 99%

See 1 more Smart Citation

Preventing the explosion of exascale profile data with smart thread-level aggregation

Lorenz

Shudler

Wolf

2015

Proceedings of the 4th Workshop on Extreme Scale Programming Tools

Self Cite

View full text Add to dashboard Cite

State of the art performance analysis tools, such as Score-P, record performance profiles on a per-thread basis. However, for exascale systems the number of threads is expected to be in the order of a billion threads, and this would result in extremely large performance profiles. In most cases the user almost never inspects the individual per-thread data. In this paper, we propose to aggregate per-thread performance data in each process to reduce its amount to a reasonable size. Our goal is to aggregate the threads such that the thread-level performance issues are still visible and analyzable. Therefore, we implemented four aggregation strategies in Score-P: (i) SUM -aggregates all threads of a process into a process profile; (ii) SET -calculates statistical key data as well as the sum; (iii) KEY -identifies three threads (i.e., key threads) of particular interest for performance analysis and aggregates the rest of the threads; (iv) CALLTREE -clusters threads that have the same call-tree structure. For each one of these strategies we evaluate the compression ratio and how they maintain thread-level performance behavior information. The aggregation does not incur any additional performance overhead at application run-time. General TermsAlgorithms, experientation, measurement Keywords Performance analysis, data compression, exascale computing Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. ESPT2015

show abstract

“…With respect to future trace analysis enhancements, we plan to extend the current OpenMP analysis of Scalasca with the analysis of OpenMP tasks. Score-P can already record task events [12,9]. However, we must extend Scalasca's profile construction algorithm and we want to add some task specific patterns to its analysis.…”

Section: Future Workmentioning

confidence: 99%

Extending Scalasca’s Analysis Features

Lorenz

Böhme

Mohr

et al. 2013

Tools for High Performance Computing 2012

Self Cite

View full text Add to dashboard Cite

Scalasca is a performance analysis tool, which parses the trace of an application run for certain patterns that indicate performance inefficiencies. In this paper, we present recently developed new features in Scalasaca. In particular, we describe two newly implemented analysis methods: the root cause analysis which tries to identify the cause of a delay and the critical path analysis, which analyses the path of execution that determines the application runtime. Furthermore, we present time-series profiling, a method that allows to explore time-dependent behavior of an application. Finally, we extended the means of Scalasca and its output format CUBE to define and display topologies.

show abstract

Profiling of OpenMP Tasks with Score-P

Cited by 14 publications

References 15 publications

Isoefficiency in Practice

Isoefficiency in Practice

Preventing the explosion of exascale profile data with smart thread-level aggregation

Extending Scalasca’s Analysis Features

Contact Info

Product

Resources

About