Get Out of the Valley: Power-Efficient Address Mapping for GPUs

Liu, Yuxi; Zhao, Xia; Jahre, Magnus; Wang, Zhenlin; Wang, Xiaolin; Luo, Yingwei; Eeckhout, Lieven

doi:10.1109/isca.2018.00024

Cited by 28 publications

(37 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We further assume two LLC slices per channel, and a total number of 64 LLC slices or 16 LLC slices per HBM stack. We use the state-of-the-art PAE randomized address mapping scheme to uniformly distribute memory accesses across LLC slices, memory channels, and banks [42]. We further assume a typical cache line size of 128 B.…”

Section: Methodsmentioning

confidence: 99%

“…The memory accesses of the SMs are routed to the different LLC slices and MCs based on the memory address. In this work, we use the recently proposed PAE address mapping which evenly distributes memory requests to different addresses across LLC slices and memory controllers to maximize parallelism in the memory subsystem [42].…”

Section: Background and Motivationmentioning

confidence: 99%

“…Enabling replication also requires increasing the size of the LLC tags since address bits that are constant in the shared LLC case (as they are used for slice selection) can change in the private case. Assuming a 30-bit physical address [42], a 4-bit larger LLC tag increases the storage overhead of our default LLC by less than 0.4% because the tags are small compared to the data. Since this overhead is due to enabling replication, it affects Adaptive LLC [75] and SelRep LLC equally.…”

Section: Overheadsmentioning

confidence: 99%

See 2 more Smart Citations

Selective Replication in Memory-Side GPU Caches

Zhao

Jahre

Eeckhout

2020

2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

Self Cite

View full text Add to dashboard Cite

Data-intensive applications put immense strain on the memory systems of Graphics Processing Units (GPUs). To cater to this need, GPU memory systems distribute requests across independent units to provide high bandwidth by servicing requests (mostly) in parallel. We find that this strategy breaks down for shared data structures because the shared Last-Level Cache (LLC) organization used by contemporary GPUs stores shared data in a single LLC slice. Shared data requests are hence serialized-resulting in data-intensive applications not being provided with the bandwidth they require. A private LLC organization can provide high bandwidth, but it is often undesirable since it significantly reduces the effective LLC capacity. In this work, we propose the Selective Replication (SelRep) LLC which selectively replicates shared read-only data across LLC slices to improve bandwidth supply while ensuring that the LLC retains sufficient capacity to keep shared data cached. The compile-time component of SelRep LLC uses dataflow analysis to identify read-only shared data structures and uses a special-purpose load instruction for these accesses. The runtime component of SelRep LLC then monitors the caching behavior of these loads. Leveraging an analytical model, SelRep LLC chooses a replication degree that carefully balances the effective LLC bandwidth benefits of replication against its capacity cost. SelRep LLC consistently provides high performance to replication-sensitive applications across different data set sizes. More specifically, SelRep LLC improves performance by 19.7% and 11.1% on average (and up to 61.6% and 31.0%) compared to the shared LLC baseline and the state-of-the-art Adaptive LLC, respectively.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Background and Motivationmentioning

confidence: 99%

Section: Overheadsmentioning

confidence: 99%

See 1 more Smart Citation

Selective Replication in Memory-Side GPU Caches

Zhao

Jahre

Eeckhout

2020

2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

Self Cite

View full text Add to dashboard Cite

show abstract

“…The low-order row bits are then XORed with the bank bits to generate new bank bits. The authors of [2] proposed a binary invertible matrix (BIM) for GPU mapping (Figure 2f), which represents memory remapping operations. The BIM composes all address mapping schemes through AND and XOR operations, and exploits its reversibility property to ensure that all possible correspondences are considered.…”

Section: Address Mappingmentioning

confidence: 99%

“…The nine differences are stored in Dif_ram. Among these values, the difference stored in Dif_ram [2] and Dif_ram [5] (000027F0) is equal to M (10224D) output by the PCU, the difference stored in Dif_ram [8] is FFFFAFF8 (<0), and the remaining entries Dif_ram [0], Dif_ram [1], Dif_ram [3], Dif_ram [4], Dif_ram [6], and Dif_ram [7] are all equal. The output terminal S0 of the AND gate is 1, the output terminals S1 and S2 are 0, and the AL output is 100, denoting that the memory access address stream follows the 2D memory access pattern.…”

Section: Arbitration Logic (Al)mentioning

confidence: 99%

Memory Access Optimization of a Neural Network Accelerator Based on Memory Controller

Wei

Chen

et al. 2021

Electronics

View full text Add to dashboard Cite

Special accelerator architecture has achieved great success in processor architecture, and it is trending in computer architecture development. However, as the memory access pattern of an accelerator is relatively complicated, the memory access performance is relatively poor, limiting the overall performance improvement of hardware accelerators. Moreover, memory controllers for hardware accelerators have been scarcely researched. We consider that a special accelerator memory controller is essential for improving the memory access performance. To this end, we propose a dynamic random access memory (DRAM) memory controller called NNAMC for neural network accelerators, which monitors the memory access stream of an accelerator and transfers it to the optimal address mapping scheme bank based on the memory access characteristics. NNAMC includes a stream access prediction unit (SAPU) that analyzes the type of data stream accessed by the accelerator via hardware, and designs the address mapping for different banks using a bank partitioning model (BPM). The image mapping method and hardware architecture were analyzed in a practical neural network accelerator. In the experiment, NNAMC achieved significantly lower access latency of the hardware accelerator than the competing address mapping schemes, increased the row buffer hit ratio by 13.68% on average (up to 26.17%), reduced the system access latency by 26.3% on average (up to 37.68%), and lowered the hardware cost. In addition, we also confirmed that NNAMC efficiently adapted to different network parameters.

show abstract

TRP: A Foundational Platform for High-Performance Low-Power Embedded Image Processing

Jahre

Millet

2020

Towards Ubiquitous Low-Power Image Processing Platforms

Self Cite

View full text Add to dashboard Cite

Embedded image processing systems face stringent and conflicting constraints which commonly result in developers overly specialising systems to the problem-at-hand. In other words, they give priority to efficiency, which is an immediate concern, over the longer term development cost reduction benefits of building reusable components. In this paper, we present the foundational T Reference Platform (TRP) which enables making domain-specific generality versus specificity trade-offs through the definition of TRP instances. Each TRP instance includes the key software and hardware components for a given domain as well as productivityenhancing components if these can be accommodated within the typical constraints of the domain. While TRP instances primarily enable intra-domain reuse, they also enable inter-domain reuse as collections of components used in one instance may be straightforwardly reused in other instances. At present, TRP instances are defined for the space, medical, automotive, robotics, and Unmanned Aerial Vehicle (UAV) domains.

show abstract

Get Out of the Valley: Power-Efficient Address Mapping for GPUs

Cited by 28 publications

References 41 publications

Selective Replication in Memory-Side GPU Caches

Selective Replication in Memory-Side GPU Caches

Memory Access Optimization of a Neural Network Accelerator Based on Memory Controller

TRP: A Foundational Platform for High-Performance Low-Power Embedded Image Processing

Contact Info

Product

Resources

About