Perceptron Based Consumer Prediction in Shared-Memory Multiprocessors

Leventhal, Sean; Franklin, Mark A.

doi:10.1109/iccd.2006.4380808

Cited by 9 publications

(4 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Instruction-based predictors [52], [53] are proposed as an alternative to normal address-based predictors. There are also some coherence predictors based on perceptron [54]. All of these optimizations are costly and may require some major changes to a processor design.…”

Section: Related Workmentioning

confidence: 99%

Invalidate or Update? Revisiting Coherence for Tomorrow's Cache Hierarchies

Zhu

Shahab

Katsarakis

et al. 2021

2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT)

View full text Add to dashboard Cite

Shared on-chip last-level caches (LLCs) play a key role in capturing the large working sets of today's dataintensive workloads. However, they pose a fundamental scalability challenge in the transistor-limited post-Moore regime. Recent work has argued for Next-Generation LLCs (NG-LLC) based on private caches in die-stacked DRAM, which can provide hundreds of MBs of per-core LLC capacity at similar access latency to today's shared LLCs. While NG-LLCs offer a number of advantages, their private design exposes longlatency inter-core reads for read/write shared data, which hurt performance in parallel workloads. One way to eliminate the long latency of reads to read/write shared data is through the use of updating coherence protocols that eagerly push updates from a writer core into caches of recent readers. Alas, these protocols are known to generate excess cache and interconnect traffic that can be detrimental to overall performance. While hybrid protocols that try to alleviate the problem by combining invalidating and updating protocols have been proposed, we find their performance benefit to be small for NG-LLCs.This work observes that the number of writes to a read/write shared cache block is likely to be stable over several consecutive write/read iterations. Based on this insight, we propose the 1-Update protocol that records the number of writes without an intervening read by a sharer, and subsequently uses the recorded value to send at most one update after that number of writes has taken place. We have formally verified 1-Update and show that it achieves high efficacy in covering remote misses for read/write shared cache blocks while minimizing excess cache and interconnect traffic.

show abstract

Section: Related Workmentioning

confidence: 99%

Invalidate or Update? Revisiting Coherence for Tomorrow's Cache Hierarchies

Zhu

Shahab

Katsarakis

et al. 2021

2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT)

View full text Add to dashboard Cite

show abstract

“…The perceptron is almost the simplest possible neural networks, which enables hardware implementation. Because of its simplicity, the perceptron technique is used later for several other systems problems [22], [42]. LSTM is a recurrent neural networks that is getting popular in the application in systems problems.…”

Section: Related Workmentioning

confidence: 99%

Learning Forward Reuse Distance

Li,

2020

Preprint

View full text Add to dashboard Cite

Caching techniques are widely used in the era of cloud computing from applications, such as Web caches to infrastructures, Memcached and memory caches in computer architectures. Prediction of cached data can greatly help improve cache management and performance. The recent advancement of deep learning techniques enables the design of novel intelligent cache replacement policies.In this work, we propose a learning-aided approach to predict future data accesses. We find that a powerful LSTM-based recurrent neural network model can provide high prediction accuracy based on only a cache trace as input. The high accuracy results from a carefully crafted locality-driven feature design. Inspired by the high prediction accuracy, we propose a pseudo OPT policy and evaluate it upon 13 real-world storage workloads from Microsoft Research. Results demonstrate that the new cache policy improves the state-of-art practical policies by up to 19.2% and incurs only 2.3% higher miss ratio than OPT on average.

show abstract

“…In contrast, there are several methods in the literature to address the miss rate concern for off‐chip multi‐processor systems [3, 4, 11–13]. Recent off‐chip researches reduce miss in sets of cache blocks that are accessed in consistent streams [4, 13].…”

Section: Related Workmentioning

confidence: 99%

“…Some of the other existing off‐chip methods moderate the number of remote misses by updating invalidated blocks. To avoid generating excessive update messages to nodes that no longer wish to consume the data, the consumer set has to be predicted [11, 12].…”

Section: Related Workmentioning

confidence: 99%

Tree‐based scheme for reducing shared cache miss rate leveraging regional, statistical and temporal similarities

Lenjani¹,

Hashemi²

2014

IET Computers & Digital Techniques

View full text Add to dashboard Cite

Cache miss can have a major impact on overall performance of many-core systems. A miss may result in extra traffic and delay because of coherency messages. This has been reduced in coarse-grain coherency protocols where only shared misses require a coherency message. Conventional off-chip methods manage the shared miss rate by relying on reuse histories. However the pertinent memory overhead that comes with reuse histories makes them impractical for on-chip multi-processor systems. In this study, a new scheme has been proposed to reduce shared cache miss rate in multi-processor system-on-chips that benefits from novel prefetching techniques to L2 caches from off-chip memories or other remote L2 caches located on-chip. In the proposed scheme, the previously proposed Virtual Tree Coherence (VTC) method has been extended to limit block forwarding messages to true sharers within each region. Instead of relying on exact reuse histories, shared regions are searched for regional, temporal and statistical similarities. These similarities are exploited for determining the sharers that should receive the forwarded blocks. The proposed method has been evaluated with Splash-2 workloads. Simulation results indicate that the proposed method has reduced shared miss count by up to 75%, and improved interconnect traffic by up to 47% compared with VTC.

show abstract

Perceptron Based Consumer Prediction in Shared-Memory Multiprocessors

Cited by 9 publications

References 23 publications

Invalidate or Update? Revisiting Coherence for Tomorrow's Cache Hierarchies

Invalidate or Update? Revisiting Coherence for Tomorrow's Cache Hierarchies

Learning Forward Reuse Distance

Tree‐based scheme for reducing shared cache miss rate leveraging regional, statistical and temporal similarities

Contact Info

Product

Resources

About