All-window profiling and composable models of cache sharing

Xiang, Xiaoya; Bao, Bin; Bai, Tao; Ding, Chen; Chilimbi, Trishul

doi:10.1145/1941553.1941567

Cited by 40 publications

(26 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A faster, but less detailed, approach is to only simulate/model parts of the system, and in particular the memory system. Such methods are either trace driven [2,4,3,27] or use high-level data [29,16] similar to the data we use. Finally, the least detailed approach simply aims to identify which applications are sensitive to resource contention [28,17,11].…”

Section: Related Workmentioning

confidence: 99%

Modeling performance variation due to cache sharing

Sandberg

Sembrant

Hägersten

et al. 2013

2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

Shared cache contention can cause significant variability in the performance of co-running applications from run to run. This variability arises from different overlappings of the applications' phases, which can be the result of offsets in application start times or other delays in the system. Understanding this variability is important for generating an accurate view of the expected impact of cache contention. However, variability effects are typically ignored due to the high overhead of modeling or simulating the many executions needed to expose them. This paper introduces a method for efficiently investigating the performance variability due to cache contention. Our method relies on input data captured from native execution of applications running in isolation and a fast, phaseaware, cache sharing performance model. This allows us to assess the performance interactions and bandwidth demands of co-running applications by quickly evaluating hundreds of overlappings.We evaluate our method on a contemporary multicore machine and show that performance and bandwidth demands can vary significantly across runs of the same set of co-running applications. We show that our method can predict application slowdown with an average relative error of 0.41% (maximum 1.8%) as well as bandwidth consumption. Using our method, we can estimate an application pair's performance variation 213× faster, on average, than native execution.

show abstract

Section: Related Workmentioning

confidence: 99%

Modeling performance variation due to cache sharing

Sandberg

Sembrant

Hägersten

et al. 2013

2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

show abstract

“…There are several models [2,4,3,17] using stack distance traces. Chandra et al [2] pioneered the field with a statistical model that estimates the probability that an access becomes a miss by prolonging its stack distance with the expected number of accesses performed by other applications.…”

Section: Related Workmentioning

confidence: 99%

“…This input data consist of the applications' fetch and hit rates, IPCs, and hit ratios as a function of their cache allocation, and can be acquired with low overhead on modern multicore machines [6]. This low-overhead data is in contrast to many existing methods for modeling cache sharing which rely on expensive data such as stack distance traces [2,4,3,17].…”

Section: Introductionmentioning

confidence: 99%

Efficient techniques for predicting cache sharing and throughput

Sandberg

Black-Schaffer

Hägersten

2012

Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

This work addresses the modeling of shared cache contention in multicore systems and its impact on throughput and bandwidth. We develop two simple and fast cache sharing models for accurately predicting shared cache allocations for random and LRU caches.To accomplish this we use low-overhead input data that captures the behavior of applications running on real hardware as a function of their shared cache allocation. This data enables us to determine how much and how aggressively data is reused by an application depending on how much shared cache it receives. From this we can model how applications compete for cache space, their aggregate performance (throughput)¸and bandwidth.We evaluate our models for two-and four-application workloads in simulation and on modern hardware. On a four-core machine, we demonstrate an average relative fetch ratio error of 6.7% for groups of four applications. We are able to predict workload bandwidth with an average relative error of less than 5.2% and throughput with an average error of less than 1.8%. The model can predict cache size with an average error of 1.3% compared to simulation.

show abstract

“…snapshots. Three recent papers have solved the problem of measuring the footprint in all execution windows and given a linear-time solution to compute the average footprint [5], [6], [16].…”

Section: Introductionmentioning

confidence: 99%

“…the additional misses due to sharing, can be computed from single-program statistics. This is known as the composable model because it uses a linear number of sequential tests to predict the performance of an exponential number of parallel co-runs [5]. In shared cache, the reuse distance in thread A is lengthened by the footprint of thread B.…”

Section: Introductionmentioning

confidence: 99%

Cache Conscious Task Regrouping on Multicore Processors

Xiang

Bao

Ding

et al. 2012

2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Ccgrid 2012)

Self Cite

View full text Add to dashboard Cite

Abstract-Because of the interference in the shared cache on multicore processors, the performance of a program can be severely affected by its co-running programs. If job scheduling does not consider how a group of tasks utilize cache, the performance may degrade significantly, and the degradation usually varies sizably and unpredictably from run to run.In this paper, we use trace-based program locality analysis and make it efficient enough for dynamic use. We show a complete on-line system for periodically measuring the parallel execution, predicting and ranking cache interference for all co-run choices, and reorganizing programs based on the prediction. We test our system on floating-point and mixed integer and floating-point workloads composed of SPEC 2006 benchmarks and compare with the default Linux job scheduler to show the benefit of the new system in improving performance and reducing performance variation.

show abstract

All-window profiling and composable models of cache sharing

Cited by 40 publications

References 28 publications

Modeling performance variation due to cache sharing

Modeling performance variation due to cache sharing

Efficient techniques for predicting cache sharing and throughput

Cache Conscious Task Regrouping on Multicore Processors

Contact Info

Product

Resources

About