StatStack: Efficient modeling of LRU caches

Eklov, David; Hägersten, Erik

doi:10.1109/ispass.2010.5452069

Cited by 112 publications

(74 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…This information is provided as input to the runtime and can be generated with help of a profiling pass. There is a plethora of prior works that use various fast profiling methods to identify which memory instructions miss in the cache hierarchy, such as [1,5,13,23]. Those memory instructions can be targeted for software prefetching as shown by [8,13,23].…”

Section: Inserting Software Prefetchesmentioning

confidence: 99%

“…However, in this work we have focused on reuse-distance based methods to model prefetches as they can be targeted to enable improved use of shared resources. Reuse-distance based models such as [1,5] can model miss ratios for individual memory instructions. This information can be used to decide which memory instructions should be targeted for software prefetching.…”

Section: Inserting Software Prefetchesmentioning

confidence: 99%

See 1 more Smart Citation

AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance

Khan

Laurenzanoy

Marsy

et al. 2015

2015 International Conference on Parallel Architecture and Compilation (PACT)

Self Cite

View full text Add to dashboard Cite

Abstract-Modern processors widely use hardware prefetching to hide memory latency. While aggressive hardware prefetchers can improve performance significantly for some applications, they can limit the overall performance in highlyutilized multicore processors by saturating the offchip bandwidth and wasting last-level cache capacity. Co-executing applications can slowdown due to contention over these shared resources.This work introduces Adaptive Resource Efficient Prefetching (AREP) − a runtime framework that dynamically combines software prefetching and hardware prefetching to maximize throughput in highly utilized multicore processors. AREP achieves better performance by prefetching data in a resource efficient way − conserving offchip-bandwidth and last-level cache capacity with accurate prefetching and by applying cache-bypassing when possible. AREP dynamically explores a mix of hardware/software prefetching policies, then selects and applies the best performing policy. AREP is phase-aware and re-explores (at runtime) for the best prefetching policy at phase boundaries.A multitude of experiments with workload mixes and parallel applications on a modern high performance multicore show that AREP can increase throughput by up to 49% (8.1% on average). This is complemented by improved fairness, resulting in average quality of service above 94%.

show abstract

Section: Inserting Software Prefetchesmentioning

confidence: 99%

Section: Inserting Software Prefetchesmentioning

confidence: 99%

AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance

Khan

Laurenzanoy

Marsy

et al. 2015

2015 International Conference on Parallel Architecture and Compilation (PACT)

Self Cite

View full text Add to dashboard Cite

show abstract

“…These information can be automatically generated or retrieved from a real application. The number of instructions, the memory access rate and the stack distance profile can be generated using tools such as an extension to CacheGrind (Babka et al, 2012), StatStack (Eklov and Hagersten, 2010) or MICA 2 (Hoste and Eeckhout, 2007). The base CPI requires a cycle accurate simulator.…”

Section: Memory Behavior Of a Taskmentioning

confidence: 99%

Simulation of Real-time Multiprocessor Scheduling with Overheads

Chéramy¹,

Déplanche²,

Hladik³

2013

Proceedings of the 3rd International Conference on Simulation and Modeling Methodologies, Technologies and Applications

View full text Add to dashboard Cite

Abstract:Numerous scheduling algorithms were and still are designed in order to handle multiprocessor architectures, raising new issues due to the complexity of such architectures. Moreover, evaluating them is difficult without a real and complex implementation. Thus, this paper presents a tool that intends to facilitate the study of schedulers by providing an easy way of prototyping. Compared to the other scheduling simulators, this tool takes into account the impact of the caches through statistical models and includes direct overheads such as context switches and scheduling decisions.

show abstract

“…MRCs capture an application's cache miss ratio as a function of the cache space available to the applications. MRCs can be generated fairly cheaply [9,6,8], and have been used in contexts such as cache partitioning [14], off-chip bandwidth partitioning [11] and cache contention modeling [7]. However, while MRCs provide significant insight into the miss ratios and data locality of applications, they are limited in their ability to predict performance.…”

Section: Miss Ratio Curvesmentioning

confidence: 99%

“…As long as the the sampling covers all phases of the application fairly, this would allow accurate data collection with a further reduction of the overhead. Such approaches have been used to speed up simulation [16] and stack distance collection [6]. However, we have not implemented this approach.…”

Section: Dynamically Varying the Pirate Sizementioning

confidence: 99%

Cache Pirating: Measuring the Curse of the Shared Cache

Eklov

Nikoleris

Black-Schaffer

et al. 2011

2011 International Conference on Parallel Processing

Self Cite

View full text Add to dashboard Cite

We present a low-overhead method for accurately measuring application performance (CPI) and off-chip bandwidth (GB/s) as a function of its the available shared cache capacity, on real hardware, with no modifications to the application or operating system. We accomplish this by co-running a Pirate application that "steals" cache space with the Target application. By adjusting how much space the Pirate steals during the Target's execution, and using hardware performance counters to record the Target's performance, we can accurately and efficiently capture performance data for the Target application as a function of its available shared cache. At the same time we use performance counters to monitor the Pirate to ensure that it is successfully stealing the desired amount of cache.To evaluate this approach, we show that 1) the cache available to the Target behaves as expected, 2) the Pirate steals the desired amount of cache, and 3) the Pirate does not impact the Target's performance. As a result, we are able to accurately measure the Target's performance while stealing between 0MB and an average of 6.1MB of the 8MB of cache on our Nehalem based test system with an average measurement overhead of only 5.5%.

show abstract

StatStack: Efficient modeling of LRU caches

Cited by 112 publications

References 29 publications

AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance

AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance

Simulation of Real-time Multiprocessor Scheduling with Overheads

Cache Pirating: Measuring the Curse of the Shared Cache

Contact Info

Product

Resources

About