Improving cache performance by selective cache bypass

Abstract-Modern CPUs operate at GHz frequencies, but the latencies of memory accesses are still relatively large, in the order of hundreds of cycles. Deeper cache hierarchies with larger cache sizes can mask these latencies for codes with good data locality and reuse, such as structured dense matrix computations. However, cache hierarchies do not necessarily benefit sparse scientific computing codes, which tend to have limited data locality and reuse. We therefore propose a new memory architecture with a Load Miss Predictor (LMP), which includes a data bypass cache and a predictor table, to reduce access latencies by determining whether a load should bypass the main cache hierarchy and issue an early load to main memory. Our architecture uses the L2 (and lower caches) as a victim cache for data removed from our bypass cache. We use cycleaccurate simulations, with SimpleScalar and Wattch to show that our LMP improves the performance of sparse codes, our application domain of interest, on average by 14%, with a 13.6% increase in power. When the LMP is used with dynamic voltage and frequency scaling (DVFS), performance can be improved by 8.7% with system power savings of 7.3% and energy reduction of 17.3% at 1800MHz relative to the base system at 2000MHz. Alternatively our LMP can be used to improve the performance of SPEC benchmarks by an average of 2.9% at the cost of 7.1% increase in average power.

show abstract

“…The importance of reducing memory access latencies is reflected in a rich set of earlier results towards faster loads [26]- [29].…”

Section: Related Researchmentioning

confidence: 99%

“…Software cache bypassing schemes were discussed by Chi [29]. Energy savings for scientific applications were considered by Choi, et al [4] and Freeh, et al [30].…”

Section: Related Researchmentioning

confidence: 99%

Load Miss Prediction - Exploiting Power Performance Trade-offs

Malkowski

Link

Raghavan

et al. 2007

2007 IEEE International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

show abstract

“…Chi and Dietz [6] present an early work on selective cache bypassing; they use compiler support. Etsion et al [8] point out that if a resident block in a cache is chosen at random, it is unlikely to be a highly-referenced block, but if an access is chosen at random, it is likely to be to a highly-referenced block.…”

Section: Related Workmentioning

confidence: 99%

Probabilistic Directed Writebacks for Exclusive Caches

Olson

Hill

2016

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

show abstract

“…Based on how to predict distant reuse blocks, these studies can be classified into PC based [5,8,33] and address based [13,15,30,31]. LRF [37] combines PC based and address based methods to improve performance.…”

Section: Related Workmentioning

confidence: 99%

Optimal bypass monitor for high performance last-level caches

Tong

Xie

et al. 2012

Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

In the last-level cache, large amounts of blocks have reuse distances greater than the available cache capacity. Cache performance and efficiency can be improved if some subset of these distant reuse blocks can reside in the cache longer. The bypass technique is an effective and attractive solution that prevents the insertion of harmful blocks.Our analysis shows that bypass can contribute significant performance improvement, and the optimal bypass can achieve similar performance compared to OPT+B, which is the theoretical optimal replacement policy. Thus, we propose a bypass technique called Optimal Bypass Monitor (OBM), which makes bypass decisions by learning and predicting the behavior of the optimal bypass. OBM keeps a short global track of the incoming-victim block pairs. By detecting the first reuse block in each pair, the behavior of the optimal bypass on the track can be asserted to guide the bypass choice.Any existing replacement policy can be extended with OBM while requiring negligible design modification. Our experimental results show that using less than 1.5KB extra memory, OBM with the NRU replacement policy outperforms LRU by 9.7% and 8.9% for single-thread and multiprogrammed workloads respectively. Compared with other state-of-the-art proposals such as DRRIP and SDBP, it achieves superior performance with less storage overhead.

show abstract

Improving cache performance by selective cache bypass

Cited by 32 publications

References 5 publications

Load Miss Prediction - Exploiting Power Performance Trade-offs

Load Miss Prediction - Exploiting Power Performance Trade-offs

Probabilistic Directed Writebacks for Exclusive Caches

Optimal bypass monitor for high performance last-level caches

Contact Info

Product

Resources

About