A dynamically tunable memory hierarchy

Balasubramonian, Rajeev; Albonesi, David H.; Buyuktosunoglu, Alper; Dwarkadas, Sandhya

doi:10.1109/tc.2003.1234523

Cited by 31 publications

(38 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Prior work on reconfigurable caches has been restricted to a single 2D die and to relatively small caches [3,34,47]. Some prior work [19,33,43] logically splits large cache capacity across cores at run-time and can be viewed as a form of reconfiguration.…”

Section: Background and Related Workmentioning

confidence: 99%

“…Many proposals of 2D reconfigurable caches already exist in the literature: they allow low access times for small cache sizes but provide the flexibility to incorporate larger capacities at longer access times. The use of 3D and NUCA makes the design of a reconfigurable cache especially attractive: (i) the spare capacity on the third die does not intrude with the layout of the second die, nor does it steal capacity from other neighboring caches (as is commonly done in 2D reconfigurable caches [3,47]), (ii) since the cache is already partitioned into NUCA banks, the introduction of additional banks and delays does not greatly complicate the control logic, (iii) the use of a third dimension allows access time to grow less than linearly with capacity (another disadvantage of a 2D reconfigurable cache).…”

Section: Reconfigurable Sram/dram Cachementioning

confidence: 99%

See 1 more Smart Citation

Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy

Madan

Zhao

Muralimanohar

et al. 2009

2009 IEEE 15th International Symposium on High Performance Computer Architecture

View full text Add to dashboard Cite

show abstract

Section: Background and Related Workmentioning

confidence: 99%

Section: Reconfigurable Sram/dram Cachementioning

confidence: 99%

Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy

Madan

Zhao

Muralimanohar

et al. 2009

2009 IEEE 15th International Symposium on High Performance Computer Architecture

View full text Add to dashboard Cite

show abstract

“…Reducing power consumption. Because constructive cache sharing reduces the amount of cache needed by multithreaded programs (by up to a factor of P ), it provides new opportunities to power down segments of the cache [25,4,41]. Consider, for example, a cache architecture that supports eight 1 MB on-chip caches that can be powered on or off as needed.…”

Section: Constructive Sharing Is Critical For Cmpsmentioning

confidence: 99%

“…In contrast, Mergesort is not bounded by memory bandwidth and its performance improves with more cores. 4 When making a design choice in the CMP design space, a typical goal is to optimize the performance of a suite of benchmark applications (e.g. SPEC) measured by aggregate performance metrics.…”

Section: Default Configurations: Pdf Vs Wsmentioning

confidence: 99%

Scheduling threads for constructive cache sharing on CMPs

Chen

Gibbons

Kozuch

et al. 2007

Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures

120

View full text Add to dashboard Cite

In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3-1.6X performance improvement relative to WS for several fine-grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors.

show abstract

“…Methods to change the size and associativity of a cache hierarchy dynamically have been explored [2,29] and the ability to disable various levels of a multi-level cache in the interest of reducing latency and reducing power consumption has been considered [5]. Finally, adjusting the size of cache lines dynamically to lower the cache miss rate has been considered [31].…”

Section: Related Workmentioning

confidence: 99%

Superoptimization of memory subsystems

Wingbermuehle

Cytron

Chamberlain

2014

Proceedings of the 2014 SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems

View full text Add to dashboard Cite

The disparity in performance between processors and main memories has led computer architects to incorporate large cache hierarchies in modern computers. Because these cache hierarchies are designed to be general-purpose, they may not provide the best possible performance for a given application. In this paper, we determine a memory subsystem well suited for a given application and main memory by discovering a memory subsystem comprised of caches, scratchpads, and other components that are combined to provide better performance. We draw motivation from the superoptimization of instruction sequences, which successfully finds unusually clever instruction sequences for programs. Targeting both ASIC and FPGA devices, we show that it is possible to discover unusual memory subsystems that provide performance improvements over a typical memory subsystem.

show abstract

A dynamically tunable memory hierarchy

Cited by 31 publications

References 30 publications

Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy

Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy

Scheduling threads for constructive cache sharing on CMPs

Superoptimization of memory subsystems

Contact Info

Product

Resources

About