Architecting an Energy-Efficient DRAM System for GPUs

Chatterjee, Niladrish; O’Connor, Mike; Lee, Donghyuk; Johnson, D.; Keckler, Stephen W.; Rhu, Minsoo; Dally, William J.

doi:10.1109/hpca.2017.58

Cited by 77 publications

(28 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In our work, we use an in-cache computing architecture similar to BLADE proposed by , targeted for the L1 cache of ARM-based many-core systems, as opposed to the Last Level Cache (LLC), as in NeuralCache proposed by Eckert et al (2018). Regarding HBM proposed by Lee et al (2014), emerging memory architectures have been explored, but mainly for GPUs, as discussed in Chatterjee et al (2017). To the best of our knowledge, this is the first work that simulates in-cache acceleration along with HBM at system level in Linux-based systems.…”

Section: Related Workmentioning

confidence: 99%

Gem5-X: A Gem5-based System Level Simulation Framework to Optimize Many-core Platforms

2019

High Performance Computing (HPC 2019)

View full text Add to dashboard Cite

The rapid expansion of online-based services requires novel energy and performance efficient architectures to meet power and latency constraints. Fast architectural exploration has become a key enabler in the proposal of architectural innovation. In this paper, we present gem5-X, a gem5-based system level simulation framework, and a methodology to optimize many-core systems for performance and power. As real-life case studies of many-core server workloads, we use real-time video transcoding and image classification using convolutional neural networks (CNNs). Gem5-X allows us to identify bottlenecks and evaluate the potential benefits of architectural extensions such as in-cache computing and 3D stacked High Bandwidth Memory. For real-time video transcoding, we achieve 15% speed-up using in-order cores with in-cache computing when compared to a baseline in-order system and 76% energy savings when compared to an Out-of-Order system. When using HBM, we further accelerate real-time transcoding and CNNs by up to 7% and 8% respectively.

show abstract

Section: Related Workmentioning

confidence: 99%

Gem5-X: A Gem5-based System Level Simulation Framework to Optimize Many-core Platforms

2019

High Performance Computing (HPC 2019)

View full text Add to dashboard Cite

show abstract

“…SJF dynamically trades off the latency of completing all memory requests of a warp against the bandwidth utilization benefits of FR-FCFS. Chatterjee et al [10] propose a static reordering scheme to reduce the toggling rate in DRAM. Scheduling is orthogonal to address mapping because it attempts to increase row buffer hit rates while address mapping attempts to evenly distribute memory requests across channels and banks.…”

Section: Related Workmentioning

confidence: 99%

“…Thus, low-entropy address bits should be mapped to rows -to exploit row buffer locality -and high-entropy bits should be mapped to channels and banks -to exploit parallelism. Address mapping schemes have previously been proposed for single-core CPUs [5], multi-core CPUs [7] and GPUs [4], [8], [9], [10]. Our objective is to systematically analyze the entropy of the concurrent memory addresses in GPU-compute workloads and use this insight to derive efficient address mapping policies.…”

Section: Introductionmentioning

confidence: 99%

“…Unfortunately, the entropy valley commonly overlaps with bits that have high entropy for CPUs. Thus, CPU-oriented address mapping schemes [4], [8], [9], [10] fail to deliver high memory request parallelism for GPU-compute workloads.…”

Section: Introductionmentioning

confidence: 99%

“…On the contrary, we need to harvest entropy across a broad selection of address bits to be robust to entropy-variation across applications and across phases within a single application. Prior work fails to achieve this requirement since they gather entropy across a narrow address range by XORing the bank and channel bits with the least significant row bits [4], [8], [9], [10].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Get Out of the Valley: Power-Efficient Address Mapping for GPUs

Liu

Zhao

Jahre

et al. 2018

2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA)

View full text Add to dashboard Cite

GPU memory systems adopt a multi-dimensional hardware structure to provide the bandwidth necessary to support 100s to 1000s of concurrent threads. On the software side, GPU-compute workloads also use multi-dimensional structures to organize the threads. We observe that these structures can combine unfavorably and create significant resource imbalance in the memory subsystem-causing low performance and poor power-efficiency. The key issue is that it is highly applicationdependent which memory address bits exhibit high variability. To solve this problem, we first provide an entropy analysis approach tailored for the highly concurrent memory request behavior in GPU-compute workloads. Our window-based entropy metric captures the information content of each address bit of the memory requests that are likely to co-exist in the memory system at runtime. Using this metric, we find that GPU-compute workloads exhibit entropy valleys distributed throughout the lower order address bits. This indicates that efficient GPU-address mapping schemes need to harvest entropy from broad address-bit ranges and concentrate the entropy into the bits used for channel and bank selection in the memory subsystem. This insight leads us to propose the Page Address Entropy (PAE) mapping scheme which concentrates the entropy of the row, channel and bank bits of the input address into the bank and channel bits of the output address. PAE maps straightforwardly to hardware and can be implemented with a tree of XOR-gates. PAE improves performance by 1.31× and power-efficiency by 1.25× compared to state-of-the-art permutation-based address mapping.

show abstract

Emerging Hardware Technologies for IoT Data Processing

Bojnordi

Behnam

2020

Intelligent Internet of Things

View full text Add to dashboard Cite

Architecting an Energy-Efficient DRAM System for GPUs

Cited by 77 publications

References 29 publications

Gem5-X: A Gem5-based System Level Simulation Framework to Optimize Many-core Platforms

Gem5-X: A Gem5-based System Level Simulation Framework to Optimize Many-core Platforms

Get Out of the Valley: Power-Efficient Address Mapping for GPUs

Emerging Hardware Technologies for IoT Data Processing

Contact Info

Product

Resources

About