An implementation of Auto-Memoization mechanism on ARM-based superscalar processor

Summary Function memoization is an optimization technique that reduces a function call overhead when the same input appears again. A table that stores the previous result is searched and used to skip the repeated computation. This way, it increases the performance of the function call. In this article, we propose a software approach of function memoization to improve computing efficiency by bypassing the execution of the function implemented using approximate computing techniques. Searching overhead is a primary concern in any memoization technique proposed so far. In traditional function memoization, the input arguments are first searched in the look‐up table (LUT) for an exact match, and the corresponding result is extracted for further use. But, in this article, a decision‐making rule is proposed to help us decide whether to search the LUT or go for the actual computation. This decision‐making model is implemented through Bloom filter and Cantor's pairing function. Because Bloom filter sometimes produces false‐positive results, we suggest a simple approximation technique that searches the LUT for an approximate match rather than an exact match. The proposed model also contains a bypass algorithm implemented through C++ code that identifies the trivial computations from the input argument of the candidate function. By this, we can avoid the actual calculation and generate the result directly. Here, trivial computation identifies one or more input arguments that are either 0 or prefix±1$$ \pm 1 $$. To analyze the effectiveness of our proposed technique, we conducted several experiments using the benchmarks from the AxBench suite. We found that our result outperforms some of the methods proposed so far in terms of energy consumption and quality of results, particularly in image processing applications.

Section: Prior Workmentioning

confidence: 99%

Approximate function memoization

Arundhati

Jena

Pani

2022

“…DTM is a reuse technique that operates on traces of instructions and is often implemented on top of Von Neumann‐based superscalar architectures, with further studies that include speculative execution . Speculative execution often improves the reuse rate of traces, because it enables reuse based on speculative values for input operands.…”

Section: Related Workmentioning

confidence: 99%

“…15 The size of each operation, ie, the reuse granularity, can vary from a single instruction 16 to groups of instructions, such as functions, 11 expressions, 17 basic blocks, 18 sub-blocks, 19 or traces. 20 DTM 10 is a reuse technique that operates on traces of instructions and is often implemented on top of Von Neumann-based superscalar architectures, [16][17][18][19][20][21][22][23][24] with further studies that include speculative execution. [25][26][27][28] Speculative execution often improves the reuse rate of traces, because it enables reuse based on speculative values for input operands.…”

Section: Related Workmentioning

confidence: 99%

DF‐DTM: Dynamic Task Memoization and reuse in dataflow

Rouberte

Sena

Nery

et al. 2018

Summary Instruction Reuse is a technique adopted in Von Neumann architectures that improves performance by avoiding redundant execution of instructions when the result to be produced can be obtained by searching an input/output memoization table for such instruction. Trace reuse can be applied to traces of instructions in a similar fashion. However, those techniques are yet to be studied in the context of the Dataflow model, which has been gaining traction in the high performance computing community due to its inherent parallelism. Dataflow programs are represented by directed graphs where nodes are instructions or tasks and edges denote data dependencies between tasks. This work presents Dataflow Dynamic Task Memoization (DF‐DTM), a technique that allows the reuse of both nodes and subgraphs in dataflow, which are analogous to instructions and traces, respectively. The potential of DF‐DTM is evaluated by a series of experiments that analyze the behavior of redundant tasks in five relevant benchmarks, where up to 99.70% of the instantiated tasks could be reused. Moreover, this paper evaluates how reuse rates can be affected by limiting subgraph size, memoization table size, task granularity, and problem size, showing that DF‐DTM can yield good reuse rates in more realistic environments.

“…22 Moreover, other studies have implemented similar memoization schemes into ARM-based superscalar processors. [23][24][25][26] Some works have also explored the reuse of computation in the GPU domain. 27 For instance, redundant fragment shader executions have been reused on a mobile GPU through hardware memoization.…”

Section: Related Workmentioning

confidence: 99%

DTM@GPU: Characterizing and evaluating trace redundancy in GPU

Marzulo

Sena

Nery

et al. 2018

Summary In a program, there is usually a significant amount of instructions that are repeatedly executed with the same inputs during the execution. This redundancy allows the reuse of previous computations, potentially reducing the program execution time. The Dynamic Trace Memoization technique (DTM) was proposed to exploit the reuse of a dynamic sequence of redundant instructions for superscalar CPUs. This paper proposes the application of the DTM technique on a GPU architecture. We propose the DTM@GPU model that adapts the original DTM technique to the NVIDIA GPU architecture by introducing architectural modifications and the identification of different trace reuse styles in multithreaded environments. We investigate reuse opportunities in real‐world GPU applications and the potential performance gains. We also perform a detailed investigation on the characteristics of the reused traces. This characterization shows the number and size of the reused traces, the influence of the cache size on reuse rates, and the cycles that are saved when all threads in a warp reuse instructions or traces. The results show approximately up to 35.3% of reuse, yielding an estimated speedup gain of 10.7%.