Reducing cache misses through programmable decoders

Zhang, Chuanjun

doi:10.1145/1328195.1328200

Cited by 8 publications

(3 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One disadvantage of this method is high per-access power consumption. An enhanced B-Cache [85] reduced the total access energy consumption; however, it remained higher than that of a lower associativity cache. The FASTA-based VMWA cache enables almost complete conflict miss elimination, and at the same time, significantly reduces the cache access energy consumption compared to a typical way-associative cache.…”

Section: Related Workmentioning

confidence: 84%

FASTA: Revisiting Fully Associative Memories in Computer Microarchitecture

Garzón,

Hanhan,

Lanuzza

et al. 2024

IEEE Access

View full text Add to dashboard Cite

Associative access is widely used in fundamental microarchitectural components, such as caches and TLBs. However, associative (or content addressable) memories (CAMs) have been traditionally considered too large, too energy-hungry, and not scalable, and therefore, have limited use in modern computer microarchitecture. This work revisits these presumptions and proposes an energy-efficient fullyassociative tag array (FASTA) architecture, based on a novel complementary CAM (CCAM) bitcell. CCAM offers a full CMOS solution for CAM, removing the need for time-and energy-consuming precharge and combining the speed of NOR CAM and low energy consumption of NAND CAM. While providing better performance and energy consumption, CCAM features a larger area compared to state-of-the-art CAM designs. We further show how FASTA can be used to construct a novel aliasing-free, energy-efficient, Very-Many-Way Associative (VMWA) cache. Circuit-level simulations using 16 nm FinFET technology show that a 128 kB FASTA-based 256-way 8-set associative cache is 28% faster and consumes 88% less energy-per-access than a same sized 8-way (256-set) SRAM based cache, while also providing aliasingfree operation. System-level evaluation performed on the Sniper simulator shows that the VMWA cache exhibits lower Misses Per Kilo Instructions (MPKI) for the majority of benchmarks. Specifically, the 256way associative cache achieves 17.3%, 11.5%, and 1.2% lower average MPKI for L1, L2, and L3 caches, respectively, compared to a 16-way associative cache. The average IPC improvement for L1, L2, and L3 caches are 1.6%, 1.4%, and 0.2%, respectively.

show abstract

Section: Related Workmentioning

confidence: 84%

FASTA: Revisiting Fully Associative Memories in Computer Microarchitecture

Garzón,

Hanhan,

Lanuzza

et al. 2024

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Due to its simple structure, snoopy protocol is considered more advantageous than other protocols. Since system bus is an exclusive resource, when the number of processor cores interconnected with the system bus is large, the efficiency of the protocol drops dramatically [4]. In snoopy protocol all requests are broadcasted in a undifferentiated manner into the system bus and so all the processors connected with the system bus must read the request and it has to check whether its cache contains the requested data block's copy.…”

Section: Snoopy Protocolmentioning

confidence: 99%

Hybrid Cache Coherence Protocol for Multi-Core Processor Architecture

Muthukumar¹,

Dhinakaran.²

2013

IJCA

View full text Add to dashboard Cite

The advances in circuit technology with constraints in power dissipation and clocking have led to integrating more processing cores onto a single chip, making it as the dominant processor architecture. This design of multi-core architectures also referred to as Chip Multiprocessors (CMPs) are gaining popularity because they have the potential to drive the future performance gains without any problems of power dissipation and complexity. Nevertheless, in order to run several independent programs in different processing cores requires them to cooperate for a single computation. Thus the communication architecture is the primary focus of research in achieving the scalability of this architecture. Coherence protocols and interconnection networks have resolved some communication gaps, but memory communication through cache has been the focus of attention in CMP. This problem has been addressed with many hardware and software solutions like Directorybased, Snoopy-based, Snarfing, etc., but the performance of the system is still not up to the level of expectation. The proposed model is to develop a hybrid cache coherence protocol referred as MESCIF (Modified Exclusive Shared Clean Invalid Forward), which combines the advantages of both directorybased and broadcasting protocols. This can be achieved by introducing a small directory based cache (DB-CACHE) and cache-coherence bus (CC-BUS) into the existing CMP architecture which overcomes the problems of existing methods. The architecture is simulated using a modular discrete event driven computer system simulator platform called gem5 simulation tool.

show abstract

“…There has been significant work on runtime effects due to cache performance; however, most of this research focuses on minimizing cache misses [1,2,8,17,18,19]. By minimizing cache misses, energy spent in accessing memory is decreased, and the overall application runtime is improved.…”

Section: Related Workmentioning

confidence: 99%

Runtime adaptation

McDaniel

Hazelwood

2012

Proceedings of the 2nd International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era

View full text Add to dashboard Cite

Static alignment techniques are well studied and have been incorporated into compilers in order to optimize code locality for the instruction fetch unit in modern processors. However, current static alignment techniques have several limitations that cannot be overcome. In the exascale era, it becomes even more important to break from static techniques and develop adaptive algorithms in order to maximize the utilization of every processor cycle. In this paper, we explore those limitations and show that reactive realignment, a method where we dynamically monitor running applications, react to symptoms of poor alignment, and adapt alignment to the current execution environment and program input, is more scalable than static alignment. We present fetchesper-instruction as a runtime indicator of poor alignment. Additionally, we discuss three main opportunities that static alignment techniques cannot leverage, but which are increasingly important in large scale computing systems: microarchitectural differences of cores, dynamic program inputs that exercise different and sometimes alternating code paths, and dynamic branch behavior, including indirect branch behavior and phase changes. Finally, we will present several instances where our trigger for reactive realignment may be incorporated in practice, and discuss the limitations of dynamic alignment.

show abstract

Reducing cache misses through programmable decoders

Cited by 8 publications

References 45 publications

FASTA: Revisiting Fully Associative Memories in Computer Microarchitecture

FASTA: Revisiting Fully Associative Memories in Computer Microarchitecture

Hybrid Cache Coherence Protocol for Multi-Core Processor Architecture

Runtime adaptation

Contact Info

Product

Resources

About