Performance Monitoring on the POWER5™ Microprocessor

Mericas, Alex E.

doi:10.1201/9781420037425.ch12

Cited by 15 publications

(17 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Collecting CPI stacks on out-of-order cores is more complicated because of various overlap effects between miss events, e.g., a longlatency load may hide the latency of another independent long-latency load miss or mispredicted branch, etc. Recent commercial processors such as IBM Power5 [23] and Intel Sandy Bridge [12] however provide support for computing memory stall components. PIE scheduling also requires the number of LLC misses and the number of dynamically executed instructions, which can be measured using existing hardware performance counters.…”

Section: Hardware Supportmentioning

confidence: 99%

Scheduling heterogeneous multi-cores through Performance Impact Estimation (PIE)

Craeynest

Jaleel

Eeckhout

et al. 2012

SIGARCH Comput. Archit. News

116

113

View full text Add to dashboard Cite

show abstract

Section: Hardware Supportmentioning

confidence: 99%

Scheduling heterogeneous multi-cores through Performance Impact Estimation (PIE)

Craeynest

Jaleel

Eeckhout

et al. 2012

SIGARCH Comput. Archit. News

116

113

View full text Add to dashboard Cite

show abstract

“…For cores with multiple commit width, at each cycle, multiple counters can increase, each corresponding to a retirement slot. The mechanism described here is similar to the performance monitors in IBM POWER5 [32]; with the following extensions: depending on how the cache miss is served, the dCache is incremented differently. Details will be discussed in Section 4.1.1.…”

Section: Performance Profile With Hardware Performance Monitorsmentioning

confidence: 99%

“…Obtaining accurate execution time breakdowns in an out-of-order processor core is difficult due to the overlap of multiple on-the-fly instructions. Examining the instructions at the head of ROB gives us some clues [32] to the cause of a stall. In this section, we show how to obtain such execution time breakdowns for TLS execution.…”

Section: Performance Profile With Hardware Performance Monitorsmentioning

confidence: 99%

“…On the other hand, Lu et al [25,26] generate helper thread prefetches using information obtained from the hardware monitors on the Sun UltraSPARC R . The optimization framework proposed in ADORE [25,26] is similar to the speculative thread optimization framework proposed in this paper, with the following differences: (i) our work uses hardware-based performance counters that generate cycle breakdowns [9,32], while ADORE uses event-based hardware performance counters; (ii) in ADORE, a dynamic compiler is responsible for generating and patching re-optimized code at runtime, while our scheme does not require dynamic code generation; and (iii) we carefully evaluate the performance impact of speculation threads before optimization, while ADORE does not evaluate the effectiveness of the prefetching threads.…”

Section: Related Workmentioning

confidence: 99%

“…The profile can be obtained through either software instrumentation or hardware performance monitor sampling. In this paper, hardware-based performance counters [9,32] are used. A small piece of code that initializes these counters is executed at the beginning of execution by modifying a libc entry-point routine named libc start main.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Dynamic performance tuning for speculative threads

Luo

Packirisamy

Hsu

et al. 2009

Proceedings of the 36th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

In response to the emergence of multicore processors, various novel and sophisticated execution models have been introduced to fully utilize these processors. One such execution model is Thread-Level Speculation (TLS), which allows potentially dependent threads to execute speculatively in parallel. While TLS offers significant performance potential for applications that are otherwise non-parallel, extracting efficient speculative threads in the presence of complex control flow and ambiguous data dependences is a real challenge. This task is further complicated by the fact that the performance of speculative threads is often architecture-dependent, input-sensitive, and exhibits phase behaviors. Thus we propose dynamic performance tuning mechanisms that determine where and how to create speculative threads at runtime. This paper describes the design, implementation, and evaluation of hardware and software support that takes advantage of runtime performance profiles to extract efficient speculative threads. In our proposed framework, speculative threads are monitored by hardware-based performance counters and their performance impact is estimated. The creation of speculative threads is adjusted based on the estimation. This paper proposes speculative threads performance estimation techniques, that are capable of correctly determining whether speculation can improve performance for loops that corresponds to 83.8% of total loop execution time across all benchmarks. This paper also examines several dynamic performance tuning policies and finds that the best tuning policy achieves an overall speedup of 36.8% on a set of benchmarks from SPEC2000 suite, which outperforms static thread management by 9.5%.

show abstract

Stable Matching Scheduler for Single-ISA Heterogeneous Multi-core Processors

Wang

Liu

et al. 2015

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Performance Monitoring on the POWER5™ Microprocessor

Cited by 15 publications

References 3 publications

Scheduling heterogeneous multi-cores through Performance Impact Estimation (PIE)

Scheduling heterogeneous multi-cores through Performance Impact Estimation (PIE)

Dynamic performance tuning for speculative threads

Stable Matching Scheduler for Single-ISA Heterogeneous Multi-core Processors

Contact Info

Product

Resources

About