2006
DOI: 10.1145/1168918.1168880
|View full text |Cite
|
Sign up to set email alerts
|

A performance counter architecture for computing accurate CPI components

Abstract: A common way of representing processor performance is to use Cycles per Instruction (CPI) `stacks' which break performance into a baseline CPI plus a number of individual miss event CPI components. CPI stacks can be very helpful in gaining insight into the behavior of an application on a given microprocessor; consequently, they are widely used by software application developers and computer architects. However, computing CPI stacks on superscalar out-of-order processors is challenging because of various overla… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
89
0

Year Published

2007
2007
2023
2023

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 58 publications
(90 citation statements)
references
References 16 publications
1
89
0
Order By: Relevance
“…Power5's 2-way SMT feature does a relatively good job of keeping the pipeline busy, however instruction cache misses still account for a significant fraction of stall cycles (12%) for both of the WebSphere J2EE applications. It has been shown that the Power5 counter mechanism actually underestimates the performance penalty of icache misses [13]. Consequently, we consider this estimate a conservative lower bound.…”
Section: Stall Cyclesmentioning
confidence: 99%
“…Power5's 2-way SMT feature does a relatively good job of keeping the pipeline busy, however instruction cache misses still account for a significant fraction of stall cycles (12%) for both of the WebSphere J2EE applications. It has been shown that the Power5 counter mechanism actually underestimates the performance penalty of icache misses [13]. Consequently, we consider this estimate a conservative lower bound.…”
Section: Stall Cyclesmentioning
confidence: 99%
“…For example, a load instruction can miss in the data cache a few cycles after a branch is mispredicted. However, it has been observed (and we confirmed) that overlapping between different types of miss-events is rare enough that ignoring it results in negligible error in typical applications [19], [12]. This paper focuses on improving the accuracy of the modeled CP I D$miss (i.e., CPI component due to long latency data cache misses) since it is the component with the largest error in prior first-order models [18], [19].…”
Section: Background: First-order Modelmentioning
confidence: 64%
“…We compare against a cycle accurate simulator rather than real hardware to validate our models since a simulator provides insights that would be challenging to obtain without changes to currently deployed superscalar performance counter hardware [12]. We believe the most important factor is comparing two or more competing (hybrid) analytical models against a single detailed simulator provided the latter captures the behavior one wishes to model analytically.…”
Section: Methodsmentioning
confidence: 99%
“…A recent paper [21] proposes a new cycle accounting architecture for SMT processors based on estimating the CPI stack of each running task [22]. This proposal tracks fifteen different components of the CPI stack with a dedicated hardware.…”
Section: Related Workmentioning
confidence: 99%