Low-overhead call path profiling of unmodified, optimized code

Froyd, Nathan; Mellor-Crummey, John; Fowler, Rob

doi:10.1145/1088149.1088161

Cited by 77 publications

(69 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Complex function call transitions by setjmp/longjmp are known to break the orderly sequence of calls and returns as discussed in [20]. Therefore, several practical concerns must be addressed to keep the consistency of a loop stack within the call stack.…”

Section: To Track the Precise Loop Stack In A Real Programmentioning

confidence: 99%

Identifying Program Loop Nesting Structures during Execution of Machine Code

Sato

Inoguchi

Nakamura

2014

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYThis paper presents a mechanism for detecting dynamic loop and procedure nesting during the actual program execution on-the-fly. This mechanism aims primarily at making better strategies for performance tuning or parallelization. Using a pre-compiled application executable machine code as an input, our mechanism statically generates simple but precise markers that indicate loop entries and loop exits, and dynamically monitors loop nesting that appears during the actual execution together with call context tree. To keep precise loop structures all the time, we monitor the indirect jumps that enter the loop regions and the setjmp/longjmp functions that cause irregular function call transfers. We also present a novel representation called Loop-Call Context Graph that can keep track of inter-procedural loop nests. We implement our mechanism and evaluate it using SPEC CPU2006 benchmark suite. The results confirm that our mechanism can successfully reveal the precise inter-procedural loop nest structures from all of SPEC CPU2006 benchmark executions without any particular compiler support. The results also show that it can reduce runtime loop detection overheads compared with the existing loop profiling method. key words: dynamic loop nests, loop-call context tree, on-the-fly loop detection

show abstract

Section: To Track the Precise Loop Stack In A Real Programmentioning

confidence: 99%

Identifying Program Loop Nesting Structures during Execution of Machine Code

Sato

Inoguchi

Nakamura

2014

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

show abstract

“…HPCToolkit uses a lightweight trampoline, as shown in Figure 1(a) [9]. Instead of pointing return addresses to a stack, the topmost function of a prefix is instrumented to return into a trampoline function.…”

Section: A Fast Call-path Unwindingmentioning

confidence: 99%

“…This event causes control flow to skip the return that would install the trampoline in a lower frame. The trampoline approach thus must instrument all non-local exits to routines, which requires more complex code analysis [10]. Our approach avoids this analysis by simply instrumenting all return addresses.…”

Section: A Fast Call-path Unwindingmentioning

confidence: 99%

“…It then uses this information to map the profile data back to the user's source code and its dynamic execution path. In order to keep overhead low, HPCToolkit applies a trampoline-based prefix optimization [9], [11] similar to the thunk optimization presented in this paper. We discussed the advantages of our approach in Section II-A4.…”

Section: B Parameter Redistribution Bottleneckmentioning

confidence: 99%

See 1 more Smart Citation

Reconciling Sampling and Direct Instrumentation for Unintrusive Call-Path Profiling of MPI Programs

Szebenyi

Gamblin

Schulz

et al. 2011

2011 IEEE International Parallel &Amp; Distributed Processing Symposium

View full text Add to dashboard Cite

Abstract-We can profile the performance behavior of parallel programs at the level of individual call paths through sampling or direct instrumentation. While we can easily control measurement dilation by adjusting the sampling frequency, the statistical nature of sampling and the difficulty of accessing the parameters of sampled events make it unsuitable for obtaining certain communication metrics, such as the size of message payloads. Alternatively, direct instrumentation, which is preferable for capturing message-passing events, can excessively dilate measurements, particularly for C++ programs, which often have many short but frequently called class member functions. Thus, we combine these techniques in a unified framework that exploits the strengths of each approach while avoiding their weaknesses: We use direct instrumentation to intercept MPI routines while we record the execution of the remaining code through low-overhead sampling. One of the main technical hurdles mastered was the inexpensive and portable determination of call-path information during the invocation of MPI routines. We show that the overhead of our implementation is sufficiently low to support substantial performance improvement of a C++ fluid-dynamics code.

show abstract

“…This context contains the current path of the task through the distributed system. Then, statistical sampling is used to indirectly measure resource usage, and requires no intrusive OS-level instrumentation [20].…”

Section: Performance Profilesmentioning

confidence: 99%

Self-adapting Service Level in Java Enterprise Edition

et al. 2009

View full text Add to dashboard Cite

Abstract. Application servers are subject to varying workloads, which suggests an autonomic management to maintain optimal performance. We propose to integrate in the component-based programming model often used in current application servers the concept of service level adaptation, allowing some components to dynamically degrade or upgrade their level of service. Our goal is to be able, under heavy workloads, to trade a lower service level of the most resource-intensive components for a stable performance of the server as a whole. Upgrading or degrading components is autonomously performed through runtime profiling, which is used to estimate the application's hot spots and target adaptations. In addition to finding the best adaptations, this performance profile allows our system to characterize the effects of past adaptations; in particular given the current workload, it is possible to estimate if a service level upgrade might result in an overload. As a result, by stabilizing the server at peak performance via component adaptations, we are able to drastically improve both overall latency and throughput. For instance, on both the RUBiS 1 and TPC-W benchmarks 2 , we are able to maintain peak performance in heavy load scenarios, far exceeding the initial capacity of the system.

show abstract

Low-overhead call path profiling of unmodified, optimized code

Cited by 77 publications

References 18 publications

Identifying Program Loop Nesting Structures during Execution of Machine Code

Identifying Program Loop Nesting Structures during Execution of Machine Code

Reconciling Sampling and Direct Instrumentation for Unintrusive Call-Path Profiling of MPI Programs

Self-adapting Service Level in Java Enterprise Edition

Contact Info

Product

Resources

About