Power-efficient computing for compute-intensive GPGPU applications

Gilani, Syed Zohaib; Kim, Nam Sung; Schulte, Michael

doi:10.1109/hpca.2013.6522330

Cited by 45 publications

(15 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Kondo and Nakamura [9] propose bit-partitioning of register file banks and Wang et al [25] investigate asymmetrically sized register banks. Furthermore, the work by Gilani et al [5] proposes partitioning of the datapath and register file, in GPUs, into two parts. The approach of exploiting narrow-width integers is orthogonal to our proposal as we target floating-point operands.…”

Section: Related Workmentioning

confidence: 99%

A Framework for Automated and Controlled Floating-Point Accuracy Reduction in Graphics Applications on GPUs

Angerd

Sintorn

Stenström

2017

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Reducing the precision of floating-point values can improve performance and/or reduce energy expenditure in computer graphics, among other, applications. However, reducing the precision level of floating-point values in a controlled fashion needs support both at the compiler and at the microarchitecture level. At the compiler level, a method is needed to automate the reduction of precision of each floating-point value. At the microarchitecture level, a lower precision of each floating-point register can allow more floating-point values to be packed into a register file. This, however, calls for new register file organizations.This article proposes an automated precision-selection method and a novel GPU register file organization that can store floating-point register values at arbitrary precisions densely. The automated precision-selection method uses a data-driven approach for setting the precision level of floating-point values, given a quality threshold and a representative set of input data. By allowing a small, but acceptable, degradation in output quality, our method can remove a significant amount of the bits needed to represent floating-point values in the investigated kernels (between 28% and 60%). Our proposed register file organization exploits these lowerprecision floating-point values by packing several of them into the same physical register. This reduces the register pressure per thread by up to 48%, and by 27% on average, for a negligible output-quality degradation. This can enable GPUs to keep up to twice as many threads in flight simultaneously.A. Angerd et al.Narrowing the width of floating-point values is an effective approach to both achieve higher performance [6] as well as higher energy efficiency [8,16,22], especially for GPUs, which now are supporting 16-bit floating-point standards [14]. A substantially narrower width of floatingpoint values can open up many novel optimization approaches at the hardware level, such as more resource-efficient register files, data-paths, functional units, and cache memory subsystems. However, to leverage such optimizations, two issues must be addressed. First, the width of each and every floating-point value must be established at the instruction level. Second, architectural support is needed to use the established widths to utilize register file, data path, functional unit, or cache resources more efficiently. The goal of this article is to provide such a framework.Programming language models that enable approximate computing, such as EnerJ [20] and Flex-Java [15], take a binary view to declare a variable as either approximable or precise. Hence, they cannot deal with an arbitrary width of floating-point variables. Even if there were support for specifying precision, it would be laborious or nearly impossible for programmers to use it efficiently. Also, it would need support at the instruction-set architecture level, such as in Quora [24], to specify error-bounds at the instruction level.Precimonious [18] provides a framework to automatically select among...

show abstract

Section: Related Workmentioning

confidence: 99%

A Framework for Automated and Controlled Floating-Point Accuracy Reduction in Graphics Applications on GPUs

Angerd

Sintorn

Stenström

2017

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…Register compression is an orthogonal approach that tries to reduce power by exploiting the register value similarity property. Scalar unit [23,15,49] exploits a special case of value similarity where all thread registers of a warp register have the same value. Scalar register file [23] eliminates redundant power consumption in that case by storing the thread register value of only one SIMT lane shared across all lanes.…”

Section: Related Workmentioning

confidence: 99%

“…Scalar unit [23,15,49] exploits a special case of value similarity where all thread registers of a warp register have the same value. Scalar register file [23] eliminates redundant power consumption in that case by storing the thread register value of only one SIMT lane shared across all lanes. Prior works [16,17,30,33] have also used the notion of value similarity.…”

Section: Related Workmentioning

confidence: 99%

Warped-compression

Lee

Kim

Koo

et al. 2015

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

This paper presents Warped-Compression, a warp-level register compression scheme for reducing GPU power consumption. This work is motivated by the observation that the register values of threads within the same warp are similar, namely the arithmetic differences between two successive thread registers is small. Removing data redundancy of register values through register compression reduces the effective register width, thereby enabling power reduction opportunities. GPU register files are huge as they are necessary to keep concurrent execution contexts and to enable fast context switching. As a result register file consumes a large fraction of the total GPU chip power. GPU design trends show that the register file size will continue to increase to enable even more thread level parallelism. To reduce register file data redundancy warped-compression uses low-cost and implementationefficient base-delta-immediate (BDI) compression scheme, that takes advantage of banked register file organization used in GPUs. Since threads within a warp write values with strong similarity, BDI can quickly compress and decompress by selecting either a single register, or one of the register banks, as the primary base and then computing delta values of all the other registers, or banks. Warped-compression can be used to reduce both dynamic and leakage power. By compressing register values, each warp-level register access activates fewer register banks, which leads to reduction in dynamic power. When fewer banks are used to store the register content, leakage power can be reduced by power gating the unused banks. Evaluation results show that register compression saves 25% of the total register file power consumption. Bank ArbiterRegister Bank 0 Register Bank 31Operand Collector InterconnectRegister Bank 1

show abstract

“…Additionally, multiple concurrent applications were not considered during the execution scenario. Finally, in [20], solutions to improve power efficiency for GPUs are presented. The presented solutions, however, require hardware support, and as a result they are difficult to incorporate into existing systems.…”

Section: Introductionmentioning

confidence: 99%

“…The presented solutions, however, require hardware support, and as a result they are difficult to incorporate into existing systems. Moreover, authors in [20] do not consider concurrent GPU applications in their execution scenario.…”

Section: Introductionmentioning

confidence: 99%

Improving GPU Performance with a Power-Aware Streaming Multiprocessor Allocation Methodology

Tasoulas¹,

Anagnostopoulos²

2019

Electronics

View full text Add to dashboard Cite

Graphics processing units (GPUs) are extensively used as accelerators across multiple application domains, ranging from general purpose applications to neural networks, and cryptocurrency mining. The initial utilization paradigm for GPUs was one application accessing all the resources of the GPU. In recent years, time sharing is broadly used among applications of a GPU, nevertheless, spatial sharing is not fully explored. When concurrent applications share the computational resources of a GPU, performance can be improved by eliminating idle resources. Additionally, the incorporation of GPUs in embedded and mobile devices increases the demand for power efficient computation due to battery limitations. In this article, we present an allocation methodology for streaming multiprocessors (SMs). The presented methodology works for two concurrent applications on a GPU and determines an allocation scheme that will provide power efficient application execution, combined with improved GPU performance. Experimental results show that the developed methodology yields higher throughput while achieving improved power efficiency, compared to other SM power-aware and performance-aware policies. If the presented methodology is adopted, it will lead to higher performance of applications that are concurrently executing on a GPU. This will lead to a faster and more efficient acceleration of execution, even for devices with restrained energy sources.

show abstract

Power-efficient computing for compute-intensive GPGPU applications

Cited by 45 publications

References 16 publications

A Framework for Automated and Controlled Floating-Point Accuracy Reduction in Graphics Applications on GPUs

A Framework for Automated and Controlled Floating-Point Accuracy Reduction in Graphics Applications on GPUs

Warped-compression

Improving GPU Performance with a Power-Aware Streaming Multiprocessor Allocation Methodology

Contact Info

Product

Resources

About