Adaptive prefetching for shared cache based chip multiprocessors

Kandemir, Mahmut; Zhang, Yuanrui; Öztürk, Özcan

doi:10.1109/date.2009.5090768

Cited by 6 publications

(14 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Unpredictable memory access streams make extracting benefits from memory prefetching difficult [37,57]. Depending on the access pattern, only 14%-97% of memory bandwidth can actually be utilized [57].…”

Section: Memory Access Streams and Efficiencymentioning

confidence: 99%

“…This is done with local independent hardware prefetch [37] or cache fill streams for a cache-coherent CMP, a list of outstanding load-store requests for a massively multithreaded architecture like a GPU, or via a sequence of DMA requests for a local store architecture like STI Cell [62]. In all of these cases, requests are sent independently over an unpredictable network and thus arrive in nearly random order to memory [76,69].…”

Section: Memory Access Streams and Efficiencymentioning

confidence: 99%

“…Memory prefetching techniques focus on reducing latency and offer little benefit in systems that are bound by memory bandwidth. Prefetching techniques typically perform predictions independently at each processor and thus create out-of-order access patterns [37].…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Collective Memory Transfers for Multi-Core Chips

Williams¹,

Shalf²

2013

View full text Add to dashboard Cite

Section: Memory Access Streams and Efficiencymentioning

confidence: 99%

Section: Memory Access Streams and Efficiencymentioning

confidence: 99%

See 1 more Smart Citation

Collective Memory Transfers for Multi-Core Chips

Williams¹,

Shalf²

2013

View full text Add to dashboard Cite

“…This is done with local independent hardware prefetch [19] or cache fill streams for a cache-coherent CMP, a list of outstanding load-store requests for a massively multithreaded architecture like a GPU, or via a sequence of DMA requests for a local store architecture like STI Cell [34]. In all of these cases, requests are sent independently over an unpredictable network and thus arrive in nearly random order to memory [38,42].…”

Section: Memory Access Streams and Efficiencymentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Collective memory transfers for multi-core chips

Williams

Shalf

2014

Proceedings of the 28th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

Future performance improvements for microprocessors have shifted from clock frequency scaling towards increases in on-chip parallelism. Performance improvements for a wide variety of parallel applications require domain-decomposition of data arrays from a contiguous arrangement in memory to a tiled layout for on-chip L1 data caches and scratchpads. However, DRAM performance suffers under the non-streaming access patterns generated by many independent cores. We propose collective memory scheduling (CMS) that actively takes control of collective memory transfers such that requests arrive in a sequential and predictable fashion to the memory controller. CMS uses the hierarchically tiled arrays formalism to compactly express collective operations, which greatly improves programmability over conventional prefetch or list-DMA approaches. CMS reduces application execution time by up to 32% and DRAM read power by 2.2×, compared to a baseline DMA architecture such as STI Cell.

show abstract

Last Level Collective Hardware Prefetching For Data-Parallel Applications

Shalf

2017

2017 IEEE 24th International Conference on High Performance Computing (HiPC)

View full text Add to dashboard Cite

With rapidly increasing parallelism, DRAM performance and power have surfaced as primary constraints from consumer electronics to high performance computing (HPC) for a variety of applications, including bulk-synchronous dataparallel applications which are key drivers for multi-core, with examples including image processing, climate modeling, physics simulation, gaming, face recognition, and many others. We present the last-level collective prefetcher (LLCP), a purely hardware last-level cache (LLC) prefetcher that exploits the highly correlated prefetch patterns of data-parallel algorithms that would otherwise not be recognized by a prefetcher that is oblivious to data parallelism. LLCP generates prefetches on behalf of multiple cores in memory address order to maximize DRAM efficiency and bandwidth, and can prefetch from multiple memory pages without expensive translations. Compared to well-established other prefetchers, LLCP improves execution time by 5.5% on average (10% maximum), increases DRAM bandwidth by 9% to 18%, decreases DRAM rank energy by 6%, produces 27% more timely prefetches, and increases coverage by 25% at minimum.

show abstract

Adaptive prefetching for shared cache based chip multiprocessors

Cited by 6 publications

References 23 publications

Collective Memory Transfers for Multi-Core Chips

Collective Memory Transfers for Multi-Core Chips

Collective memory transfers for multi-core chips

Last Level Collective Hardware Prefetching For Data-Parallel Applications

Contact Info

Product

Resources

About