Speedup stacks: Identifying scaling bottlenecks in multi-threaded applications

Eyerman, Stijn; Bois, Kristof Du; Eeckhout, Lieven

doi:10.1109/ispass.2012.6189221

Cited by 44 publications

(18 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some recent work in performance visualization focused on capturing and visualizing gross performance scalability trends in multi-threaded applications running on multicore hardware, but do not guide the programmer on where to focus optimization. Speedup stacks [8] present an analysis of the causes of why an application does not achieve perfect scalability, comparing achieved speedup of a multi-threaded program versus ideal speedup. Speedup stacks measure the impact of synchronization and interference in shared hardware resources, and attribute the gap between achieved and ideal speedup to the different possible performance delimiters.…”

Section: Performance Visualizationmentioning

confidence: 99%

See 1 more Smart Citation

Bottle graphs

Bois

Sartor

Eyerman

et al. 2013

Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages &Amp; Application

Self Cite

View full text Add to dashboard Cite

Understanding and analyzing multi-threaded program performance and scalability is far from trivial, which severely complicates parallel software development and optimization. In this paper, we present bottle graphs, a powerful analysis tool that visualizes multi-threaded program performance, in regards to both per-thread parallelism and execution time. Each thread is represented as a box, with its height equal to the share of that thread in the total program execution time, its width equal to its parallelism, and its area equal to its total running time. The boxes of all threads are stacked upon each other, leading to a stack with height equal to the total program execution time. Bottle graphs show exactly how scalable each thread is, and thus guide optimization towards those threads that have a smaller parallel component (narrower), and a larger share of the total execution time (taller), i.e. to the 'neck' of the bottle.Using light-weight OS modules, we calculate bottle graphs for unmodified multi-threaded programs running on real processors with an average overhead of 0.68%. To demonstrate their utility, we do an extensive analysis of 12 Java benchmarks running on top of the Jikes JVM, which introduces many JVM service threads. We not only reveal and explain scalability limitations of several well-known Java benchmarks; we also analyze the reasons why the garbage collector itself does not scale, and in fact performs optimally with two collector threads for all benchmarks, regardless of the number of application threads. Finally, we compare the scalability of Jikes versus the OpenJDK JVM. We demonstrate how useful and intuitive bottle graphs are as a tool to analyze scalability and help optimize multi-threaded applications.

show abstract

Section: Performance Visualizationmentioning

confidence: 99%

“…We exclude the numbers for avrora and pseudoJBB forFigures 5,7,8,9,15,and 16. For these benchmarks, it is impossible to vary the number of application threads independently from the problem size.…”

mentioning

confidence: 99%

Bottle graphs

Bois

Sartor

Eyerman

et al. 2013

Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages &Amp; Application

Self Cite

View full text Add to dashboard Cite

show abstract

“…According to the recent studies [3,5], there are several factors that hinder shared-memory parallel programs from scaling perfectly: contention for shared resources such as last-level cache (LLC) and memory bandwidth, synchronization stalls including spinning and yielding, and workload imbalance and parallelization overhead. Eyerman et al [5] quantifies the impact of these scaling delimiters and show that synchronization is the most important component for the most of the benchmarks, especially the poorly scaling ones.…”

Section: Motivational Datamentioning

confidence: 99%

“…Eyerman et al [5] quantifies the impact of these scaling delimiters and show that synchronization is the most important component for the most of the benchmarks, especially the poorly scaling ones.…”

Section: Motivational Datamentioning

confidence: 99%

Dynamic acceleration of multithreaded program critical paths in near-threshold systems

Cho

Mahlke

2012

2012 45th Annual IEEE/ACM International Symposium on Microarchitecture Workshops

View full text Add to dashboard Cite

Near-Threshold Computing (NTC) is an effective technique to improve energy efficiency. However, single thread performance can suffer dramatically in NTC systems as cores must be run at low frequency to ensure proper operation. A potential way to solve this problem is to accelerate a core for a short period of time using dynamic voltage and frequency scaling (DVFS). This fast-mode execution option must be selectively applied so as to not sacrifice the overall efficiency of the NTC system. To this end, this paper presents a novel software framework to improve the performance of multithreaded programs through smart scheduling of the fast mode cores. Our framework statically analyzes a target application and instruments dynamic monitoring and priority management code into the program. At runtime, the probabilistic scheduler assigns the cores to the fast mode according to the priority set by the instrumented code. In this way, the program critical path is dynamically accelerated by spending more time in the fast mode so that the overall performance gets improved.

show abstract

“…Furthermore, the scenario to identify the serial part and parallel part is a great challenge. Eyerman et al [55,56] proposed an off-line analysis tool to examine the Cycle Per Instruction (CPI) breakdowns for the parallel applications, and then use the results to estimate threads' demands during execution. In order to work in the multithreaded environment, a thread was sampled when it was running alone and then comprehensively running with other threads, such that a better scheduling was concluded [57].…”

Section: Heterogenous Microprocessorsmentioning

confidence: 99%

A Hardware and Software Integrated Approach for Adaptive Thread Management in Multicore Multithreaded Microprocessors

Weng¹

View full text Add to dashboard Cite

The Multicore Multithreaded Microprocessor maximizes parallelism on a chip for the optimal system performance, such that its popularity is growing rapidly in high-performance computing. It increases the complexity in resource distribution on a chip by leading it to two directions: isolation and unification. On one hand, multiple cores are implemented to deliver the computation and memory accessing resources to more than one thread at the same time. Nevertheless, it limits the threads' access to resources in different cores, even if extensively demanded. On the other hand, simultaneous multithreaded architectures unify the domestic execution resources together for concurrently running threads. In such an environment, threads are greatly affected by the inter-thread interference. Moreover, the impacts of the complicated distribution are enlarged by variation in workload behaviors. As a result, the microprocessor requires an adaptive management scheme to schedule threads throughout different cores and coordinate them within cores.In this study, an adaptive thread management scheme was proposed, integrating both hardware and software approaches. The instruction fetch policy at the hardware level took the responsibility by prioritizing domestic threads, while the Operating System scheduler at the software level was used to pair threads dynamivi cally to multiple cores. The tie between them was the proposed online linear model, which was dynamically constructed for every thread based on data misses by the regression algorithm. Consequently, the hardware part of the proposed scheme proactively granted higher priority to the threads with less predicted long-latency loads, expecting they would better utilize the shared execution resources. Meanwhile, the software part was invoked by such a model upon significant changes in the execution phases and paired threads with different demands to the same core to minimize competition on the chip. The proposed scheme was compared to its peer designs and overall 43% speedup was achieved by the integrated approach over the combination of two baseline policies in hardware and software, respectively. The overhead was examined carefully regarding power, area, storage and latency, as well as the relationship between the overhead and the performance.

show abstract

Speedup stacks: Identifying scaling bottlenecks in multi-threaded applications

Cited by 44 publications

References 18 publications

Bottle graphs

Bottle graphs

Dynamic acceleration of multithreaded program critical paths in near-threshold systems

A Hardware and Software Integrated Approach for Adaptive Thread Management in Multicore Multithreaded Microprocessors

Contact Info

Product

Resources

About