A compile-time managed multi-level register file hierarchy

Gebhart, Mark; Keckler, Stephen W.; Dally, William J.

doi:10.1145/2155620.2155675

Cited by 62 publications

(43 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To improve on the register energy efficiency, we adopt a compiler controlled hierarchical register file implementation [23], [24]. In our design, the operand register file (ORF) contains just 8 entries per thread, while each thread may have 32-256 registers in the main register file (MRF).…”

Section: B Throughput Optimized Core Architecturementioning

confidence: 99%

Scaling the Power Wall: A Path to Exascale

Villa

Johnson

O'Connor³

et al. 2014

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

100

View full text Add to dashboard Cite

Abstract-Modern scientific discovery is driven by an insatiable demand for computing performance. The HPC community is targeting development of supercomputers able to sustain 1 ExaFlops by the year 2020 and power consumption is the primary obstacle to achieving this goal. A combination of architectural improvements, circuit design, and manufacturing technologies must provide over a 20× improvement in energy efficiency. In this paper, we present some of the progress NVIDIA Research is making toward the design of Exascale systems by tailoring features to address the scaling challenges of performance and energy efficiency. We evaluate several architectural concepts for a set of HPC applications demonstrating expected energy efficiency improvements resulting from circuit and packaging innovations such as low-voltage SRAM, low-energy signaling, and on-package memory. Finally, we discuss the scaling of these features with respect to future process technologies and provide power and performance projections for our Exascale research architecture.

show abstract

Section: B Throughput Optimized Core Architecturementioning

confidence: 99%

Scaling the Power Wall: A Path to Exascale

Villa

Johnson

O'Connor³

et al. 2014

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

100

View full text Add to dashboard Cite

show abstract

“…Gebhart et al explore several register allocation algorithms and propose a compiler specifiable register file hierarchy that allows sharing of temporary register file resources among running threads, reducing the usage of this energy hogging resource [19], [12]. They also propose a unified scratch, register and primary cache that can be configured at runtime to minimize the access latencies [20].…”

Section: Related Workmentioning

confidence: 99%

An energy efficient GPGPU memory hierarchy with tiny incoherent caches

Sankaranarayanan¹,

Ardestani²,

Briz³

et al. 2013

International Symposium on Low Power Electronics and Design (ISLPED)

View full text Add to dashboard Cite

With progressive generations and the ever-increasing promise of computing power, GPGPUs have been quickly growing in size, and at the same time, energy consumption has become a major bottleneck for them. The first level data cache and the scratchpad memory are critical to the performance of a GPGPU, but they are extremely energy inefficient due to the large number of cores they need to serve. This problem could be mitigated by introducing a cache higher up in hierarchy that services fewer cores, but this introduces cache coherency issues that may become very significant, especially for a GPGPU with hundreds of thousands of in-flight threads.In this paper, we propose adding incoherent tinyCaches between each lane in an SM, and the first level data cache that is currently shared by all the lanes in an SM. In a normal multiprocessor, this would require hardware cache coherence between all the SM lanes capable of handling hundreds of thousands of threads. Our incoherent tinyCache architecture exploits certain unique features of the CUDA/OpenCL programming model to avoid complex coherence schemes. This tinyCache is able to filter out 62% of memory requests that would otherwise need to be serviced by the DL1G, and almost 81% of scratchpad memory requests, allowing us to achieve a 37% energy reduction in the on-chip memory hierarchy. We evaluate the tinyCache for different memory patterns and show that it is beneficial in most cases.

show abstract

“…Gebhart et al [4] proposed register file caching and two-level thread scheduler to reduce the number of reads and writes to the large main register file and save its dynamic energy. The authors further extended their work to the compiler level and explored register allocation algorithms to improve register energy efficiency [5]. Yu et al integrated embedded DRAM and SRAM cells to reduce area and energy [3].…”

Section: Related Workmentioning

confidence: 99%

“…For example, Nvidia Fermi GPU supports more than 20,000 parallel threads and contains 2MB register files [7]. Accessing such sizeable register files leads to massive power consumption [2][3][4][5][6]. It has been reported that the register files consume 15%-20% of the GPU stream multiprocessor's power [8].…”

Section: Introductionmentioning

confidence: 99%

Hybrid CMOS-TFET based register files for energy-efficient GPGPUs

Tan

2013

International Symposium on Quality Electronic Design (ISQED)

View full text Add to dashboard Cite

State-of-the-art General-Purpose computing on Graphics Processing Unit (GPGPU) is facing severe power challenge due to the increasing number of cores placed on a chip with decreasing feature size. In order to hide the long latency operations, GPGPU employs the fine-grained multithreading among numerous active threads, leading to the sizeable register files with massive power consumption. Exploring the optimal power savings in register files becomes the critical and first step towards the energyefficient GPGPUs. The conventional method to reduce dynamic power consumption is the supply voltage scaling, and the inter-bank tunneling FETs (TFETs) are the promising candidates compared to CMOS for low voltage operations regarding to both leakage and performance. However, always executing at the low voltage (so that low frequency) will result in significant performance degradation. In this study, we propose the hybrid CMOS-TFET based register files. To optimize the register power consumption, we allocate TFET-based registers to threads whose execution progress can be delayed to some degree to avoid the memory contentions with other threads, and the CMOS-based registers are still used for threads requiring normal execution speed. Our experimental results show that the proposed technique achieves 30% energy (including both dynamic and leakage) reduction in register files with little performance degradation compared to the baseline case equipped with naive power optimization technique.

show abstract

A compile-time managed multi-level register file hierarchy

Cited by 62 publications

References 29 publications

Scaling the Power Wall: A Path to Exascale

Scaling the Power Wall: A Path to Exascale

An energy efficient GPGPU memory hierarchy with tiny incoherent caches

Hybrid CMOS-TFET based register files for energy-efficient GPGPUs

Contact Info

Product

Resources

About