Optimizing Remote Accesses for Offloaded Kernels: Application to High-Level Synthesis for FPGA

Alias, Christophe; Darte, Alain; Plesco, Alexandru

doi:10.7873/date.2013.127

Cited by 24 publications

(27 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Previous work tries to establish analytic optimization formulations for the combined problem, such as optimizations of loop tiling parameters and reuse buffer selections are formulated into quadratic programming [9] and geometric programming [31] respectively. Alias et al uses tiling and prefetching to reduce the memory traffic [7], focusing on the Altera tool-chain. They proposed a formulation for the prefetching problem and the pipelining of communications, but their approach does not consider the balance between communication volume and scratchpad size/energy, nor any design-space exploration, contrary to the present work.…”

Section: Related Workmentioning

confidence: 99%

Polyhedral-based data reuse optimization for configurable computing

Pouchet

Zhang

Sadayappan

et al. 2013

Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays

144

108

View full text Add to dashboard Cite

Many applications, such as medical imaging, generate intensive data traffic between the FPGA and off-chip memory. Significant improvements in the execution time can be achieved with effective utilization of on-chip (scratchpad) memories, associated with careful software-based data reuse and communication scheduling techniques.We present a fully automated C-to-FPGA framework to address this problem. Our framework effectively implements data reuse through aggressive loop transformation-based program restructuring. In addition, our proposed framework automatically implements critical optimizations for performance such as task-level parallelization, loop pipelining, and data prefetching. We leverage the power and expressiveness of the polyhedral compilation model to develop a multi-objective optimization system for off-chip communications management. Our technique can satisfy hardware resource constraints (scratchpad size) while aggressively exploiting data reuse. Our approach can also be used to reduce the on-chip buffer size subject to bandwidth constraint. We also implement a fast design space exploration technique for effective optimization of program performance using the Xilinx high-level synthesis tool.

show abstract

Section: Related Workmentioning

confidence: 99%

Polyhedral-based data reuse optimization for configurable computing

Pouchet

Zhang

Sadayappan

et al. 2013

Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays

144

108

View full text Add to dashboard Cite

show abstract

“…Exploiting data overlap of successive tiles is introduced only very recently [7], here it is used after optimization to remove redundant transfers. In section VIII we compare to this strategy (inter-tile reuse) and show that it is important to include inter-tile reuse into the tile size selection process.…”

Section: Related Workmentioning

confidence: 99%

Inter-Tile Reuse Optimization Applied to Bandwidth Constrained Embedded Accelerators

Peemen

Mesman

Corporaal

2015

Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE), 2015

View full text Add to dashboard Cite

Abstract-The adoption of High-Level Synthesis (HLS) tools has significantly reduced accelerator design time. A complex scaling problem that remains is the data transfer bottleneck. To scale-up performance accelerators require huge amounts of data, and are often limited by interconnect resources. In addition, the energy spent by the accelerator is often dominated by the transfer of data, either in the form of memory references or data movement on interconnect. In this paper we drastically reduce accelerator communication by exploration of computation reordering and local buffer usage. Consequently, we present a new analytical methodology to optimize nested loops for intertile data reuse with loop transformations like interchange and tiling. We focus on embedded accelerators that can be used in a multi-accelerator System on Chip (SoC), so performance, area, and energy are key in this exploration. 1) On three common embedded applications in the image/video processing domain (demosaicing, block matching, object detection), we show that our methodology reduces data movement up to 2.1x compared to the best case of intra-tile optimization. 2) We demonstrate that our small accelerators (1-3% FPGA resources) can boost a simple MicroBlaze soft-core to the performance level of a high-end Inteli7 processor.

show abstract

“…(2) where T represents a tile, t < T represents the tiles scheduled for execution before the tile T , and t > T represents the tiles scheduled for execution after T . The denotation W(t > T ) corresponds to t>T W(t).…”

Section: Combining Load and Store Eliminationmentioning

confidence: 99%

“…This results in the code shown in Figure 1.3, where isolated variables have been put in uppercase. Statements (3) and (5) correspond to the exact regions on scalar variables. Statements (2) and (4) We show how convex array regions are used to generate calls to these operators.…”

Section: Introducing Convex Array Regionsmentioning

confidence: 99%

See 1 more Smart Citation

Beyond Do Loops: Data Transfer Generation with Convex Array Regions

Guelton

Amini

Creusillet³

2013

Languages and Compilers for Parallel Computing

View full text Add to dashboard Cite

Abstract. Automatic data transfer generation is a critical step for guided or automatic code generation for accelerators using distributed memories. Although good results have been achieved for loop nests, more complex control ows such as switches or while loops are generally not handled. This paper shows how to leverage the convex array regions abstraction to generate data transfers. The scope of this study ranges from inter-procedural analysis in simple loop nests with function calls, to inter-iteration data reuse optimization and arbitrary control ow in loop bodies. Generated transfers are approximated when an exact solution cannot be found. Array regions are also used to extend redundant load store elimination to array variables. The approach has been successfully applied to GPUs and domain-specic hardware accelerators.

show abstract

Optimizing Remote Accesses for Offloaded Kernels: Application to High-Level Synthesis for FPGA

Cited by 24 publications

References 15 publications

Polyhedral-based data reuse optimization for configurable computing

Polyhedral-based data reuse optimization for configurable computing

Inter-Tile Reuse Optimization Applied to Bandwidth Constrained Embedded Accelerators

Beyond Do Loops: Data Transfer Generation with Convex Array Regions

Contact Info

Product

Resources

About