Hinting for Auto-Memoization Processor Based on Static Binary Analysis

Tsumura, Takanori; Shibata, Yuuki; Kamimura, Kazutaka; Tsumura, Tomoaki; Nakashima, Yasuhiko

doi:10.1109/candar.2014.49

Cited by 4 publications

(5 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…DTM is a reuse technique that operates on traces of instructions and is often implemented on top of Von Neumann‐based superscalar architectures, with further studies that include speculative execution . Speculative execution often improves the reuse rate of traces, because it enables reuse based on speculative values for input operands.…”

Section: Related Workmentioning

confidence: 99%

“…15 The size of each operation, ie, the reuse granularity, can vary from a single instruction 16 to groups of instructions, such as functions, 11 expressions, 17 basic blocks, 18 sub-blocks, 19 or traces. 20 DTM 10 is a reuse technique that operates on traces of instructions and is often implemented on top of Von Neumann-based superscalar architectures, [16][17][18][19][20][21][22][23][24] with further studies that include speculative execution. [25][26][27][28] Speculative execution often improves the reuse rate of traces, because it enables reuse based on speculative values for input operands.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

DF‐DTM: Dynamic Task Memoization and reuse in dataflow

Rouberte

Sena

Nery

et al. 2018

Concurrency and Computation

View full text Add to dashboard Cite

Summary Instruction Reuse is a technique adopted in Von Neumann architectures that improves performance by avoiding redundant execution of instructions when the result to be produced can be obtained by searching an input/output memoization table for such instruction. Trace reuse can be applied to traces of instructions in a similar fashion. However, those techniques are yet to be studied in the context of the Dataflow model, which has been gaining traction in the high performance computing community due to its inherent parallelism. Dataflow programs are represented by directed graphs where nodes are instructions or tasks and edges denote data dependencies between tasks. This work presents Dataflow Dynamic Task Memoization (DF‐DTM), a technique that allows the reuse of both nodes and subgraphs in dataflow, which are analogous to instructions and traces, respectively. The potential of DF‐DTM is evaluated by a series of experiments that analyze the behavior of redundant tasks in five relevant benchmarks, where up to 99.70% of the instantiated tasks could be reused. Moreover, this paper evaluates how reuse rates can be affected by limiting subgraph size, memoization table size, task granularity, and problem size, showing that DF‐DTM can yield good reuse rates in more realistic environments.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

DF‐DTM: Dynamic Task Memoization and reuse in dataflow

Rouberte

Sena

Nery

et al. 2018

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…22 Moreover, other studies have implemented similar memoization schemes into ARM-based superscalar processors. [23][24][25][26] Some works have also explored the reuse of computation in the GPU domain. 27 For instance, redundant fragment shader executions have been reused on a mobile GPU through hardware memoization.…”

Section: Related Workmentioning

confidence: 99%

DTM@GPU: Characterizing and evaluating trace redundancy in GPU

Marzulo

Sena

Nery

et al. 2018

Concurrency and Computation

View full text Add to dashboard Cite

Summary In a program, there is usually a significant amount of instructions that are repeatedly executed with the same inputs during the execution. This redundancy allows the reuse of previous computations, potentially reducing the program execution time. The Dynamic Trace Memoization technique (DTM) was proposed to exploit the reuse of a dynamic sequence of redundant instructions for superscalar CPUs. This paper proposes the application of the DTM technique on a GPU architecture. We propose the DTM@GPU model that adapts the original DTM technique to the NVIDIA GPU architecture by introducing architectural modifications and the identification of different trace reuse styles in multithreaded environments. We investigate reuse opportunities in real‐world GPU applications and the potential performance gains. We also perform a detailed investigation on the characteristics of the reused traces. This characterization shows the number and size of the reused traces, the influence of the cache size on reuse rates, and the cycles that are saved when all threads in a warp reuse instructions or traces. The results show approximately up to 35.3% of reuse, yielding an estimated speedup gain of 10.7%.

show abstract

“…Since it is difficult to track these global changes at runtime, existing hardware memoization approaches apply to CHAPTER 1. INTRODUCTION single instructions or blocks of instructions [139,64,66,82,49,50,91,145], whereas function level memoization has only been exploited in software based solutions [132,84,110,168]. Our memoization scheme is different as it is function level and hardware based, since our work is focused on GPUs and graphical applications where it is easier to track changes to global data and no mutable state or side-effects exist.…”

Section: Eliminating Redundant Fragment Shader Executionsmentioning

confidence: 99%

“…As pointed out in [134], this concept is important because computation re-use is lucrative only when the cost of accessing the structures used for memoization is smaller than the benefit of skipping the actual computation. For this reason prior work on memoization either tries to perform memoization for multiple instructions [66,82,49,50,91,145] or for long latency operations [64].…”

Section: Task-level Complexitymentioning

confidence: 99%

Energy-efficient mobile GPU systems

Arnau¹

View full text Add to dashboard Cite

The design of mobile GPUs is all about saving energy. Smartphones and tablets are battery-operated and thus any type of rendering needs to use as little energy as possible. Furthermore, smartphones do not include sophisticated cooling systems due to their small size, making heat dissipation a primary concern. Improving the energy-efficiency of mobile GPUs will be absolutely necessary to achieve the performance required to satisfy consumer expectations, while maintaining operating time per battery charge and keeping the GPU in its thermal limits. The first step in optimizing energy consumption is to identify the sources of energy drain. Previous studies have demonstrated that the register file is one of the main sources of energy consumption in a GPU. As graphics workloads are highly data- and memory-parallel, GPUs rely on massive multithreading to hide the memory latency and keep the functional units busy. However, aggressive multithreading requires a huge register file to keep the registers of thousands of simultaneous threads. Such a big register file exceeds the power budget typically available for an embedded graphics processors and, hence, more energy-efficient memory latency tolerance techniques are necessary. On the other hand, prior research showed that the off-chip accesses to system memory are one of the most expensive operations in terms of energy in a mobile GPU. Therefore, optimizing memory bandwidth usage is a primary concern in mobile GPU design. Many bandwidth saving techniques, such as texture compression or ARM's transaction elimination, have been proposed in both industry and academia. The purpose of this thesis is to study the characteristics of mobile graphics processors and mobile workloads in order to propose different energy saving techniques specifically tailored for the low-power segment. Firstly, we focus on energy-efficient memory latency tolerance. We analyze several techniques such as multithreading and prefetching and conclude that they are effective but not energy-efficient. Next, we propose an architecture for the fragment processors of a mobile GPU that is based on the decoupled access/execute paradigm. The results obtained by using a cycle-accurate mobile GPU simulator and several commercial Android games show that the decoupled architecture combined with a small degree of multithreading provides the most energy efficient solution for hiding memory latency. More specifically, the decoupled access/execute-like design with just 4 SIMD threads/processor is able to achieve 97% of the performance of a larger GPU with 16 SIMD threads/processor, while providing 20.5% energy savings on average. Secondly, we focus on optimizing memory bandwidth in a mobile GPU. We analyze the bandwidth usage in a set of commercial Android games and find that most of the bandwidth is employed for fetching textures, and also that consecutive frames share most of the texture dataset as they tend to be very similar. However, the GPU cannot capture inter-frame texture re-use due to the big size of the texture dataset for one frame. Based on this analysis, we propose Parallel Frame Rendering (PFR), a technique that overlaps the processing of multiple frames in order to exploit inter-frame texture re-use and save bandwidth. By processing multiple frames in parallel textures are fetched once every two frames instead of being fetched in a frame basis as in conventional GPUs. PFR provides 23.8% memory bandwidth savings on average in our set of Android games, that result in 12% speedup and 20.1% energy savings. Finally, we improve PFR by introducing a hardware memoization system on top. We analyze the redundancy in mobile games and find that more than 38% of the Fragment Program executions are redundant on average. We thus propose a task-level hardware-based memoization system that provides 15% speedup and 12% energy savings on average over a PFR-enabled GPU. El diseño de las GPUs (Graphics Procesing Units) móviles se centra fundamentalmente en el ahorro energético. Los smartphones y las tabletas son dispositivos alimentados mediante baterías y, por lo tanto, cualquier tipo de renderizado debe utilizar la menor cantidad de energía posible. Mejorar la eficiencia energética de las GPUs móviles será absolutamente necesario para alcanzar el rendimiento requirido para satisfacer las expectativas de los usuarios, sin reducir el tiempo de vida de la batería. El primer paso para optimizar el consumo energético consiste en identificar qué componentes son los principales consumidores de la batería. Estudios anteriores han identificado al banco de registros y a los accessos a memoria principal como las mayores fuentes de consumo energético en una GPU. El propósito de esta tesis es estudiar las características de los procesadores gráficos móviles y de las aplicaciones móviles con el objetivo de proponer distintas técnicas de ahorro energético. En primer lugar, la investigación se centra en desarrollar métodos energéticamente eficientes para ocultar la latencia de la memoria principal. El resultado de la investigación es una arquitectura desacoplada para los Fragment Processors de la GPU. Los resultados experimentales utilizando un simulador de ciclo y distintos juegos de Android muestran que una arquitectura desacoplada, combinada con un nivel de multithreading moderado, proporciona la solución más eficiente desde el punto de vista energético para ocultar la latencia de la memoria prinicipal. Más específicamente, la arquitectura desacoplada con sólo 4 SIMD threads/processor es capaz de alcanzar el 97% del rendimiento de una GPU más grande con 16 SIMD threads/processor, al tiempo que se reduce el consumo energético en un 20.5%. En segundo lugar, el trabajo de investigación se centró en optimizar el ancho de banda en una GPU móvil. Se realizó un estudio del uso del ancho de banda en distintos juegos de Android y se observó que la mayor parte del ancho de banda se utiliza para leer texturas. Además, se observó que frames consecutivos comparten una gran parte de las texturas. Sin embargo, la GPU no puede capturar el reuso de texturas entre frames dado que el tamaño de las texturas utilizadas por un frame es mucho mayor que la caché de segundo nivel. Basándose en este análisis, se desarrolló Parallel Frame Rendering (PFR), una técnica que solapa el procesado de multiples frames consecutivos con el objetivo de explotar el reuso de texturas entre frames y ahorrar así ancho de bando. Al procesar múltiples frames en paralelo las texturas se leen de memoria principal una vez cada dos frames en lugar de leerse en cada frame como sucede en una GPU convencional. PFR proporciona un ahorro del 23.8% en ancho de banda en promedio para distintos juegos de Android, este ahorro de ancho de banda redunda en un incremento del rendimiento del 12% y un ahorro energético del 20.1%. Por último, se mejoró PFR introduciendo un sistema hardware capaz de evitar cómputos redundantes. Un análisis de distintos juegos de Android reveló que más de un 38% de las ejecuciones del Fragment Program eran redundantes en promedio. Así pues, se propuso un sistema hardware capaz de identificar y eliminar parte de los cómputos y accessos a memoria redundantes, dicho sistema proporciona un incremento del rendimiento del 15% y un ahorro energético del 12% en promedio con respecto a una GPU móvil basada en PFR.

show abstract

Hinting for Auto-Memoization Processor Based on Static Binary Analysis

Cited by 4 publications

References 5 publications

DF‐DTM: Dynamic Task Memoization and reuse in dataflow

DF‐DTM: Dynamic Task Memoization and reuse in dataflow

DTM@GPU: Characterizing and evaluating trace redundancy in GPU

Energy-efficient mobile GPU systems

Contact Info

Product

Resources

About