2012 IEEE International Symposium on Performance Analysis of Systems &Amp; Software 2012
DOI: 10.1109/ispass.2012.6189221
|View full text |Cite
|
Sign up to set email alerts
|

Speedup stacks: Identifying scaling bottlenecks in multi-threaded applications

Abstract: Multi-threaded workloads typically show sublinear speedup on multi-core hardware, i.e., the achieved speedup is not proportional to the number of cores and threads. Sublinear scaling may have multiple causes, such as poorly scalable synchronization leading to spinning and/or yielding, and interference in shared resources such as the lastlevel cache (LLC) as well as the main memory subsystem. It is vital for programmers and processor designers to understand scaling bottlenecks in existing and emerging workloads… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
18
0

Year Published

2012
2012
2019
2019

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 44 publications
(18 citation statements)
references
References 18 publications
0
18
0
Order By: Relevance
“…Some recent work in performance visualization focused on capturing and visualizing gross performance scalability trends in multi-threaded applications running on multicore hardware, but do not guide the programmer on where to focus optimization. Speedup stacks [8] present an analysis of the causes of why an application does not achieve perfect scalability, comparing achieved speedup of a multi-threaded program versus ideal speedup. Speedup stacks measure the impact of synchronization and interference in shared hardware resources, and attribute the gap between achieved and ideal speedup to the different possible performance delimiters.…”
Section: Performance Visualizationmentioning
confidence: 99%
See 1 more Smart Citation
“…Some recent work in performance visualization focused on capturing and visualizing gross performance scalability trends in multi-threaded applications running on multicore hardware, but do not guide the programmer on where to focus optimization. Speedup stacks [8] present an analysis of the causes of why an application does not achieve perfect scalability, comparing achieved speedup of a multi-threaded program versus ideal speedup. Speedup stacks measure the impact of synchronization and interference in shared hardware resources, and attribute the gap between achieved and ideal speedup to the different possible performance delimiters.…”
Section: Performance Visualizationmentioning
confidence: 99%
“…We exclude the numbers for avrora and pseudoJBB forFigures 5,7,8,9,15,and 16. For these benchmarks, it is impossible to vary the number of application threads independently from the problem size.…”
mentioning
confidence: 99%
“…According to the recent studies [3,5], there are several factors that hinder shared-memory parallel programs from scaling perfectly: contention for shared resources such as last-level cache (LLC) and memory bandwidth, synchronization stalls including spinning and yielding, and workload imbalance and parallelization overhead. Eyerman et al [5] quantifies the impact of these scaling delimiters and show that synchronization is the most important component for the most of the benchmarks, especially the poorly scaling ones.…”
Section: Motivational Datamentioning
confidence: 99%
“…Eyerman et al [5] quantifies the impact of these scaling delimiters and show that synchronization is the most important component for the most of the benchmarks, especially the poorly scaling ones.…”
Section: Motivational Datamentioning
confidence: 99%
“…Furthermore, the scenario to identify the serial part and parallel part is a great challenge. Eyerman et al [55,56] proposed an off-line analysis tool to examine the Cycle Per Instruction (CPI) breakdowns for the parallel applications, and then use the results to estimate threads' demands during execution. In order to work in the multithreaded environment, a thread was sampled when it was running alone and then comprehensively running with other threads, such that a better scheduling was concluded [57].…”
Section: Heterogenous Microprocessorsmentioning
confidence: 99%