Concise loads and stores: The case for an asymmetric compute-memory architecture for approximation

Jain, Animesh; Hill, Parker; Lin, Shih-Chieh; Khan, Muneeb; Haque, M. E.; Laurenzano, Michael A.; Mahlke, Scott; Tang, Lingjia; Mars, Jason

doi:10.1109/micro.2016.7783744

Cited by 24 publications

(26 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Reducing the precision of floating point [6,21,42] and fixed point [22] numbers has been used to alleviate the memory bandwidth bottleneck in deep neural networks [22], GPU workloads [42] and other approximation tolerant applications [21], thereby improving performance and energy efficiency. However, the compression ratio is still limited between 2:1 and 4:1 despite the loss of precision as these approaches do not exploit inter-value similarities to compress data.…”

Section: Related Workmentioning

confidence: 99%

“…Similar to most techniques that focus on data approximations [21,36,39], AVR considers that the programmer annotates memory regions that can be approximated and hence compressed in a lossy manner. This annotation also includes the size of the region as well as the datatype of the approximable data.…”

Section: Memory Blocksmentioning

confidence: 99%

“…Approximate deduplication of individual cachelines increases cache capacity [39], however, multiple values need to match at cacheline granularity. A form of lossy compression has been applied in approximate computing, but is constrained to reducing precision of single values truncating their least significant bits [6,21,22,42] and therefore achieves limited compression ratio.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Avr

Eldstål-Damlin

Trancoso

Sourdis

2019

Proceedings of the 48th International Conference on Parallel Processing

View full text Add to dashboard Cite

This paper describes Approximate Value Reconstruction (AVR), an architecture for approximate memory compression. AVR reduces the memory traffic of applications that tolerate approximations in their dataset. Thereby, it utilizes more efficiently the available off-chip bandwidth improving significantly system performance and energy efficiency. AVR compresses memory blocks using low latency downsampling that exploits similarities between neighboring values and achieves aggressive compression ratios, up to 16:1 in our implementation. The proposed AVR architecture supports our compression scheme maximizing its effect and minimizing its overheads by (i) co-locating in the Last Level Cache (LLC) compressed and uncompressed data, (ii) efficiently handling LLC evictions, (iii) keeping track of badly compressed memory blocks, and (iv) avoiding LLC pollution with unwanted decompressed data. For applications that tolerate aggressive approximation in large fractions of their data, AVR reduces memory traffic by up to 70%, execution time by up to 55%, and energy costs by up to 20% introducing up to 1.2% error to the application output. CCS CONCEPTS • Computer systems organization → Other architectures; • Hardware → Memory and dense storage.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Memory Blocksmentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Avr

Eldstål-Damlin

Trancoso

Sourdis

2019

Proceedings of the 48th International Conference on Parallel Processing

View full text Add to dashboard Cite

show abstract

“…Register-width annotations can be used to enable optimizations to, for instance, functional units (e.g., SIMD-style parallelism [14]), cache systems [6], bandwidth utilization [21], and register file organizations [12]. The register file is of particular interest in GPU architectures.…”

Section: Motivationmentioning

confidence: 99%

“…The work by Jain et al [6] also investigates the Mantissa truncation format, but in the context of optimizations in the CPU memory hierarchy. Since they target the memory hierarchy, this is orthogonal to our work.…”

Section: Related Workmentioning

confidence: 99%

A Framework for Automated and Controlled Floating-Point Accuracy Reduction in Graphics Applications on GPUs

Angerd

Sintorn

Stenström

2017

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Reducing the precision of floating-point values can improve performance and/or reduce energy expenditure in computer graphics, among other, applications. However, reducing the precision level of floating-point values in a controlled fashion needs support both at the compiler and at the microarchitecture level. At the compiler level, a method is needed to automate the reduction of precision of each floating-point value. At the microarchitecture level, a lower precision of each floating-point register can allow more floating-point values to be packed into a register file. This, however, calls for new register file organizations.This article proposes an automated precision-selection method and a novel GPU register file organization that can store floating-point register values at arbitrary precisions densely. The automated precision-selection method uses a data-driven approach for setting the precision level of floating-point values, given a quality threshold and a representative set of input data. By allowing a small, but acceptable, degradation in output quality, our method can remove a significant amount of the bits needed to represent floating-point values in the investigated kernels (between 28% and 60%). Our proposed register file organization exploits these lowerprecision floating-point values by packing several of them into the same physical register. This reduces the register pressure per thread by up to 48%, and by 27% on average, for a negligible output-quality degradation. This can enable GPUs to keep up to twice as many threads in flight simultaneously.A. Angerd et al.Narrowing the width of floating-point values is an effective approach to both achieve higher performance [6] as well as higher energy efficiency [8,16,22], especially for GPUs, which now are supporting 16-bit floating-point standards [14]. A substantially narrower width of floatingpoint values can open up many novel optimization approaches at the hardware level, such as more resource-efficient register files, data-paths, functional units, and cache memory subsystems. However, to leverage such optimizations, two issues must be addressed. First, the width of each and every floating-point value must be established at the instruction level. Second, architectural support is needed to use the established widths to utilize register file, data path, functional unit, or cache resources more efficiently. The goal of this article is to provide such a framework.Programming language models that enable approximate computing, such as EnerJ [20] and Flex-Java [15], take a binary view to declare a variable as either approximable or precise. Hence, they cannot deal with an arbitrary width of floating-point variables. Even if there were support for specifying precision, it would be laborious or nearly impossible for programmers to use it efficiently. Also, it would need support at the instruction-set architecture level, such as in Quora [24], to specify error-bounds at the instruction level.Precimonious [18] provides a framework to automatically select among...

show abstract

ScaleReactor: A graceful performance isolation agent with interference detection and investigation for container‐based scale‐out workloads

Zhu

et al. 2021

Concurrency and Computation

View full text Add to dashboard Cite

Striking a balance between improved cluster utilization and guaranteed applicationQoS is a long-standing research problem in multi-tenants shared cluster. The typical solution is to detect performance degradation and investigate the root cause to conduct performance isolation. Existing efforts rely on lots of prior knowledge of applications and the assumption of interference-free workload placement is possible.Performance interference is usually mitigated through application-level approaches such as centralized rescheduling, which is usually an hindsight and a waste of resources.In this article, we present ScaleReactor, a graceful runtime agent on a per node basis that decouples the performance isolation from centralized resource management, and migrates the performance interference of scale-out workloads in container-based cluster using a lightweight black-box approach. ScaleReactor analyzes the degree of contention for multi-dimensional resources among co-located workloads to detect the performance degradation without additional prior information, and uses correlation analysis to locate the cause of contention, while isolating resources in a graceful manner to reduce system overhead and the performance degradation of intrusive workloads. Experiments have demonstrated that ScaleReactor effectively reduces the job completion time of scale-out applications in shared clusters, with the maximum value up to 36% and low system overhead against the existing isolation mechanism.

show abstract

Concise loads and stores: The case for an asymmetric compute-memory architecture for approximation

Cited by 24 publications

References 30 publications

Avr

Avr

A Framework for Automated and Controlled Floating-Point Accuracy Reduction in Graphics Applications on GPUs

ScaleReactor: A graceful performance isolation agent with interference detection and investigation for container‐based scale‐out workloads

Contact Info

Product

Resources

About