Reducing the precision of floating-point values can improve performance and/or reduce energy expenditure in computer graphics, among other, applications. However, reducing the precision level of floating-point values in a controlled fashion needs support both at the compiler and at the microarchitecture level. At the compiler level, a method is needed to automate the reduction of precision of each floating-point value. At the microarchitecture level, a lower precision of each floating-point register can allow more floating-point values to be packed into a register file. This, however, calls for new register file organizations.This article proposes an automated precision-selection method and a novel GPU register file organization that can store floating-point register values at arbitrary precisions densely. The automated precision-selection method uses a data-driven approach for setting the precision level of floating-point values, given a quality threshold and a representative set of input data. By allowing a small, but acceptable, degradation in output quality, our method can remove a significant amount of the bits needed to represent floating-point values in the investigated kernels (between 28% and 60%). Our proposed register file organization exploits these lowerprecision floating-point values by packing several of them into the same physical register. This reduces the register pressure per thread by up to 48%, and by 27% on average, for a negligible output-quality degradation. This can enable GPUs to keep up to twice as many threads in flight simultaneously.A. Angerd et al.Narrowing the width of floating-point values is an effective approach to both achieve higher performance [6] as well as higher energy efficiency [8,16,22], especially for GPUs, which now are supporting 16-bit floating-point standards [14]. A substantially narrower width of floatingpoint values can open up many novel optimization approaches at the hardware level, such as more resource-efficient register files, data-paths, functional units, and cache memory subsystems. However, to leverage such optimizations, two issues must be addressed. First, the width of each and every floating-point value must be established at the instruction level. Second, architectural support is needed to use the established widths to utilize register file, data path, functional unit, or cache resources more efficiently. The goal of this article is to provide such a framework.Programming language models that enable approximate computing, such as EnerJ [20] and Flex-Java [15], take a binary view to declare a variable as either approximable or precise. Hence, they cannot deal with an arbitrary width of floating-point variables. Even if there were support for specifying precision, it would be laborious or nearly impossible for programmers to use it efficiently. Also, it would need support at the instruction-set architecture level, such as in Quora [24], to specify error-bounds at the instruction level.Precimonious [18] provides a framework to automatically select among...