RegMutex: Inter-Warp GPU Register Time-Sharing

Khorasani, Farzad; Esfeden, Hodjat Asghari; Farmahini-Farahani, Amin; Jayasena, Nuwan; Sarkar, Vivek

doi:10.1109/isca.2018.00073

Cited by 38 publications

(9 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table 2 presents the major parameters of the simulated system. Except for Section 4.6, all results are generated using the Fermi [24] configuration, which is the mostly targeted configuration for GPU research even in recent publications [9,10,12,16,31,35,41]. Note that although the simulations are based on Fermi architecture, the principles behind EXPARS are also applicable to newer architectures such as Kepler, Maxwell and Pascal.…”

Section: Methodsmentioning

confidence: 99%

Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

Bai

Sun

et al. 2018

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Modern Graphic Processing Units (GPUs) have become pervasive computing devices in datacenters due to their high performance with massive thread level parallelism (TLP). GPUs are equipped with large register files (RF) to support fast context switch between massive threads and scratchpad memory (SPM) to support inter-thread communication within the cooperative thread array (CTA). However, the TLP of GPUs is usually limited by the inefficient resource management of register file and scratchpad memory. This inefficiency also leads to register file and scratchpad memory underutilization. To overcome the above inefficiency, we propose a new resource management approach EXPARS for GPUs. EXPARS provides a larger register file logically by expanding the register file to scratchpad memory. When the available register file becomes limited, our approach leverages the underutilized scratchpad memory to support additional register allocation. Therefore, more CTAs can be dispatched to SMs, which improves the GPU utilization. Our experiments on representative benchmark suites show that the number of CTAs dispatched to each SM increases by 1.28× on average. In addition, our approach improves the GPU resource utilization significantly, with the register file utilization improved by 11.64% and the scratchpad memory utilization improved by 48.20% on average. With better TLP, our approach achieves 20.01% performance improvement on average with negligible energy overhead.

show abstract

Section: Methodsmentioning

confidence: 99%

Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

Bai

Sun

et al. 2018

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…Efficient register space utilization in GPUs. Works in this section aim to share the physical register file space [25,73]. In Ref.…”

Section: Related Workmentioning

confidence: 99%

“…In Ref. [25], the authors propose a software-hardware mechanism named Register Mutual Exclusion (RegMutex) to share a subset of physical registers between warps during the GPU kernel execution. RegMutex increases register utilization by sharing the physical register space (has nothing to do with nearest-neighbor data sharing), while NeDa reuses the physical register space efficiently, along with its corresponding data for a group of SP cores residing in a neighborhood window.…”

Section: Related Workmentioning

confidence: 99%

“…CRAT has nothing to do with nearest-neighbor data sharing in GPUs, which is the focus of NeDa architecture. To conclude, prior work [25,73], in register space utilization, aims to share the physical register file space (in the main register file), not the data value, in order to have an efficient register file utilization and increase the thread-level parallelism. However, NeDa aims to share the data values in the local registers (i.e., shared registers are distributed between the SP cores).…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Efficient Nearest-Neighbor Data Sharing in GPUs

Nematollahi

Sadrosadati

Falahati

et al. 2020

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Stencil codes (a.k.a. nearest-neighbor computations) are widely used in image processing, machine learning, and scientific applications. Stencil codes incur nearest-neighbor data exchange because the value of each point in the structured grid is calculated as a function of its value and the values of a subset of its nearest-neighbor points. When running on Graphics Processing Unit (GPUs), stencil codes exhibit a high degree of data sharing between nearest-neighbor threads. Sharing is typically implemented through shared memories, shuffle instructions, and on-chip caches and often incurs performance overheads due to the redundancy in memory accesses. In this article, we propose Neighbor Data (NeDa), a direct nearest-neighbor data sharing mechanism that uses two registers embedded in each streaming processor (SP) that can be accessed by nearest-neighbor SP cores. The registers are compiler-allocated and serve as a data exchange mechanism to eliminate nearest-neighbor shared accesses. NeDa is embedded carefully with local wires between SP cores so as to minimize the impact on density. We place and route NeDa in an open-source GPU and show a small area overhead of 1.3%. The cycle-accurate simulation indicates an average performance improvement of 21.8% and power reduction of up to 18.3% for stencil codes in General-Purpose Graphics Processing Unit (GPGPU) standard benchmark suites. We show that NeDa’s performance is within 13.2% of an ideal GPU with no overhead for nearest-neighbor data exchange.

show abstract

“…Several works aim to improve the performance of register iles. RegMutex [25] improved performance by sharing a subset of physical registers between warps during the GPU kernel execution. FineReg [42] achieved a higher number of concurrent CTAs by partitioning the register ile into two regions, one for active CTAs and another for pending CTAs.…”

Section: Related Workmentioning

confidence: 99%

Corf

Esfeden

Khorasani

Jeon

et al. 2019

Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Syst

Self Cite

View full text Add to dashboard Cite

The Register File (RF) in GPUs is a critical structure that maintains the state for thousands of threads that support the GPU processing model. The RF organization substantially afects the overall performance and the energy eiciency of a GPU. For example, the frequent accesses to the RF consume a substantial amount of the dynamic energy, and port contention due to limited ports on operand collectors and register ile banks afect performance as register operations are serialized. We present CORF, a compiler-assisted Coalescing Operand Register File which performs register coalescing by combining reads to multiple registers required by a single instruction, into a single physical read. To enable register coalescing, CORF utilizes register packing to co-locate narrow-width operands in the same physical register. CORF uses compiler hints to identify which register pairs are commonly accessed together. CORF saves dynamic energy by reducing the number of physical register ile accesses, and improves performance by combining read operations, as well as by reducing pressure on the register ile. To increase the coalescing opportunities, we re-architect the physical register ile to allow coalescing reads across diferent physical registers that reside in mutually exclusive sub-banks; we call this design CORF++. The compiler analysis for register allocation for CORF++ becomes a form of graph coloring called the bipartite edge frustration problem. CORF++ reduces the dynamic energy of the RF by 17%, and improves IPC by 9%.

show abstract

RegMutex: Inter-Warp GPU Register Time-Sharing

Cited by 38 publications

References 28 publications

Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory

Efficient Nearest-Neighbor Data Sharing in GPUs

Corf

Contact Info

Product

Resources

About