Toggle-Aware Compression for GPUs

Pekhimenko, Gennady; Bolotin, Evgeny; O’Connor, Mike; Mutlu, Onur; Mowry, Todd C.; Keckler, Stephen W.

doi:10.1109/lca.2015.2430853

Cited by 13 publications

(8 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Such schemes would need to decide when to compress data depending upon the potential increase in latency compared to the reduction in toggle rate and crosstalk. This is similar to the work by Pekhimenko et al [29], but taking into account the effects of crosstalk and using a model derived from realistic data.…”

Section: Discussionsupporting

confidence: 72%

Measuring and modeling on-chip interconnect power on real hardware

Adhinarayanan

Paul

Greathouse

et al. 2016

2016 IEEE International Symposium on Workload Characterization (IISWC)

View full text Add to dashboard Cite

Abstract-On-chip data movement is a major source of power consumption in modern processors, and future technology nodes will exacerbate this problem. Properly understanding the power that applications expend moving data is vital for inventing mitigation strategies. Previous studies combined data movement energy, which is required to move information across the chip, with data access energy, which is used to read or write onchip memories. This combination can hide the severity of the problem, as memories and interconnects will scale differently to future technology nodes. Thus, increasing the fidelity of our energy measurements is of paramount concern.We propose to use physical data movement distance as a mechanism for separating movement energy from access energy. We then use this mechanism to design microbenchmarks to ascertain data movement energy on a real modern processor. Using these microbenchmarks, we study the following parameters that affect interconnect power: (i) distance, (ii) interconnect bandwidth, (iii) toggle rate, and (iv) voltage and frequency. We conduct our study on an AMD GPU built in 28 nm technology and validate our results against industrial estimates for energy/bit/millimeter. We then construct an empirical model based on our characterization and use it to evaluate the interconnect power of 22 real-world applications. We show that up to 14% of the dynamic power in some applications can be consumed by the interconnect and present a range of mitigation strategies.

show abstract

Section: Discussionsupporting

confidence: 72%

Measuring and modeling on-chip interconnect power on real hardware

Adhinarayanan

Paul

Greathouse

et al. 2016

2016 IEEE International Symposium on Workload Characterization (IISWC)

View full text Add to dashboard Cite

show abstract

“…To enable LCP, ETC employs an additional 512-entry metadata cache inside the memory controller to accelerate compression metadata lookup and thus reduce the performance overhead of the LCP framework. Once the application classification logic determines that the executing application is 1) a regular application with data sharing or 2) an irregular application, ETC begins the capacity compression process by storing all data written to the GPU memory using the base-delta-immediate compression algorithm [73], which is simple to implement and effective [70][71][72][73]98]. Figure 9 shows the design overview of ETC, which consists of Application Classification, Proactive Eviction, Memoryaware Throttling, and memory Capacity Compression.…”

Section: Capacity Compressionmentioning

confidence: 99%

“…Memory Compression in GPUs. Several works study memory and cache compression in GPUs [49,70,71,79,87,99]. These works show benefits due to on-chip and off-chip memory bandwidth savings.…”

Section: Related Workmentioning

confidence: 99%

A Framework for Memory Oversubscription Management in Graphics Processing Units

Ausavarungnirun

Rossbach³

et al. 2019

Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Syst

Self Cite

View full text Add to dashboard Cite

Modern discrete GPUs support unified memory and demand paging. Automatic management of data movement between CPU memory and GPU memory dramatically reduces developer effort. However, when application working sets exceed physical memory capacity, the resulting data movement can cause great performance loss. This paper proposes a memory management framework, called ETC, that transparently improves GPU performance under memory oversubscription using new techniques to overlap eviction latency of GPU pages, reduce thrashing cost, and increase effective memory capacity. Eviction latency can be hidden by eagerly creating space for demand-paged data with proactive eviction (E). Thrashing costs can be ameliorated with memory-aware throttling (T), which dynamically reduces the GPU parallelism when page fault frequencies become high. Capacity compression (C) can enable larger working sets without increasing physical memory capacity. No single technique fits all workloads, and, thus, ETC integrates proactive eviction, memory-aware throttling and capacity compression into a principled framework that dynamically selects the most effective combination of techniques, transparently to the running software. To this end, ETC categorizes applications into three categories: regular applications without data sharing across kernels, regular applications with data sharing across kernels, and irregular applications. Our evaluation shows that ETC fully mitigates the oversubscription overhead for regular applications without data sharing and delivers performance similar to the ideal unlimited GPU memory baseline. We also show that ETC outperforms the state-of-the-art baseline by 60.4% and

show abstract

“…Data compression is a technique that exploits the redundancy in the applications' data to reduce capacity and bandwidth requirements for many modern systems by saving and transmitting data in a more compact form. Hardware-based data compression has been explored in the context of on-chip caches [4,11,25,33,49,87,89,99,118], interconnect [30], and main memory [2,37,88,90,91,104,114] as a means to save storage capacity as well as memory bandwidth. In modern GPUs, memory bandwidth is a key limiter to system performance in many workloads (Section 3).…”

Section: A Case For Caba: Data Compressionmentioning

confidence: 99%

“…Compression. Several prior works [6,11,88,89,90,91,100,104,114] study memory and cache compression with several di erent compression algorithms [4,11,25,49,87,118], in the context of CPUs or GPUs.…”

Section: Related Workmentioning

confidence: 99%

A framework for accelerating bottlenecks in GPU execution with assist warps

Vijaykumar

Pekhimenko

Jog

et al. 2017

Advances in GPU Research and Practice

View full text Add to dashboard Cite

Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, di erent bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available o -chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive.This work describes the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate di erent bottlenecks in GPU execution. CABA provides exible mechanisms to automatically generate "assist warps" that execute on GPU cores to perform speci c tasks that can improve GPU performance and e ciency.CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps.We provide a comprehensive design and evaluation of CABA to perform e ective and exible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.We believe that CABA is a exible framework that enables the use of idle resources to improve application performance with di erent optimizations and perform other useful tasks. We discuss how CABA can be used, for example, for memoization, prefetching, handling interrupts, pro ling, redundant multithreading, and speculative precomputation.

show abstract

Toggle-Aware Compression for GPUs

Cited by 13 publications

References 29 publications

Measuring and modeling on-chip interconnect power on real hardware

Measuring and modeling on-chip interconnect power on real hardware

A Framework for Memory Oversubscription Management in Graphics Processing Units

A framework for accelerating bottlenecks in GPU execution with assist warps

Contact Info

Product

Resources

About