FORAY-GEN: Automatic Generation of Affine Functions for Memory Optimizations

Issenin, Ilya; Dutt, Nikil

doi:10.1109/date.2005.157

Cited by 18 publications

(18 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We translate the source code into the FORAY format [24], which essentially consists of just the loop structure and the array access functions as affine functions of the loop iterators. We analyze the code in this format, and, perform our page-aware array interleaving transformations in this format, and then convert it back to the source code.…”

Section: Experiments and Resultsmentioning

confidence: 99%

Code Transformations for TLB Power Reduction

Jeyapaul

Shrivastava

2010

Int J Parallel Prog

View full text Add to dashboard Cite

The Translation Look-aside Buffer (TLB) is a very important part in the hardware support for virtual memory management implementation of high performance embedded systems. The TLB though small is frequently accessed, and therefore not only consumes significant energy, but also is one of the important thermal hot-spots in the processor. Recently, several circuit and microarchitectural implementations of TLBs have been proposed to reduce TLB power. One simple, yet effective TLB design for power reduction is the Use-Last TLB architecture proposed in IEEE J Solid State Circuits, 1190-1199, (2004). The Use-Last TLB architecture reduces the power consumption when the last page is accessed again. In this work, we develop code transformation techniques to reduce the page switchings in data cache accesses and propose an efficient page-aware code placement technique to enhance the energy reduction capabilities achieved by the Use-Last TLB architecture for instruction cache accesses. Our comprehensive page switch reduction algorithm results in an average of 39% reduction in the data-TLB page switching, and our code placement heuristic results in an average of 76% reduction in the instrucion-TLB page switchings with negligible impact on the performance on benchmarks from MiBench, Multimedia, DSPStone and BDTI suites. The reduced page switch count through our techniques achieves an equivalent power savings, above and beyond the reduction achieved by the Use-Last TLB architecture implementation.

show abstract

Section: Experiments and Resultsmentioning

confidence: 99%

Code Transformations for TLB Power Reduction

Jeyapaul

Shrivastava

2010

Int J Parallel Prog

View full text Add to dashboard Cite

show abstract

“…For most audio and video processing and multimedia program, this is the case. However, if there are accesses to the array that are expressed using pointers, such a program can be converted to the required form by using the FORAY technique [Issenin 2005]. Regarding dealing with conditional statements, if an array reference is executed conditionally and the condition is also an affine function of outer loop iterators, such a case is handled by the proposed approach (see Step 1-2-2 later in this section).…”

Section: Drdu: Data Reuse Analysis Algorithm and Its Use For Memory Smentioning

confidence: 99%

Drdu

Issenin

Brockmeyer

Miranda

et al. 2007

ACM Trans. Des. Autom. Electron. Syst.

View full text Add to dashboard Cite

In multimedia and other streaming applications, a significant portion of energy is spent on data transfers. Exploiting data reuse opportunities in the application, we can reduce this energy by making copies of frequently used data in a small local memory and replacing speed-and powerinefficient transfers from main off-chip memory by more efficient local data transfers. In this article we present an automated approach for analyzing these opportunities in a program that allows modification of the program to use custom scratch-pad memory configurations comprising a hierarchical set of buffers for local storage of frequently reused data. Using our approach we are able to both reduce energy consumption of the memory subsystem when using a scratch-pad memory by about a factor of two, on average, and improve memory system performance compared to a cache of the same size.

show abstract

“…A Pareto-optimum trade-off curve between execution time and resource usage is shown in Figure 4. Resource usage is obtained by taking the larger of the proportions of block RAM and slice usage [19] as seen in (1). Note that each point on the graph represents a fully placed and routed design.…”

Section: Methodsmentioning

confidence: 99%

“…Much work has been done in the development of scratchpad memories (SPM) [1,2,3] for algorithms with static memory access patterns. However, algorithms such as the Huffman decoder and some motion vector estimation approaches [4] exhibit data dependent memory access patterns, and as a result, the memory accesses cannot be predicted at compile time.…”

Section: Introductionmentioning

confidence: 99%

A Flexible Multi-port Caching Scheme for Reconfigurable Platforms

Ang

Constantinides

Cheung

et al. 2006

Reconfigurable Computing: Architectures and Applications

View full text Add to dashboard Cite

Abstract. Memory accesses contribute sunstantially to aggregate system delays. It is critical for designers to ensure that the memory subsystem is designed efficiently, and much work has been done on the exploitation of data re-use for algorithms that exhibit static memory access patterns in FPGAs. The proposed scheme enables the exploitation of data re-use for both static and non-static parallel memory access patterns through the use of a multi-port cache, where parameters can be determined at compile time and matched to the statistical properties of the application, and where sub-cache contentions are arbitrated with a semaphore-based system. A complete hardware implementation demonstrates that, for a motion vector estimation benchmark, the proposed caching scheme results in a cycle count reduction of 51% and execution time reduction of up to 24%, using a Xilinx XC2V6000 FPGA on a Celoxica RC300 board. Hardware resource usage and clock frequency penalties are analyzed while varying the number of ports and cache size. Consequently, it is demonstrated how the optimum cache size and number of ports may be established for a given datapath.

show abstract

FORAY-GEN: Automatic Generation of Affine Functions for Memory Optimizations

Abstract: Abstract

Cited by 18 publications

References 12 publications

Code Transformations for TLB Power Reduction

Code Transformations for TLB Power Reduction

Drdu

A Flexible Multi-port Caching Scheme for Reconfigurable Platforms

Contact Info

Product

Resources

About