We generate a family of FPGA stencil accelerators targeting emerging System on Chip platforms, (e.g., Xilinx Zynq or Intel SoC). Our designs come with design knobs to explore trade-offs. We also propose performance models to hone in on the most interesting design points, and show how they accurately lead to optimal designs. The optimal choice depends on problem sizes and performance goals.
I. INTRODUCTIONIterative stencil computations arise in many application domains, ranging from medical imaging to numerical simulation. Since they are computationally demanding, a large body of work addressed the problem of parallelizing and optimizing stencils for multi-cores, GPUs, and FPGAs.Earlier attempts targeting FPGAs showed that the performance of such accelerators is a complex interplay between the raw FPGA computing power, the amount of on-chip memory, and the performance of the external memory system [1]- [8]. They also illustrate different application requirements. For example, in the context of embedded vision, designers often seek the cheapest design achieving real-time performance constraints (e.g., 4K@60fps). In an exascale context, they may want to maximize performance (measured in ops-persecond) for a given FPGA board, while maintaining power dissipation to a minimum. Therefore, we explore a family of design options that can accommodate a large set of constraints, by exposing trade-offs between computing power, bandwidth requirements, and FPGA resource usage. We focus on system-level issues. Our aim is not to provide hand-optimized FPGA implementations. We have developed a code generator that produces HLS-optimized C/C++ descriptions of accelerator instances, leaving low-level decisions to the HLS back-end.Our designs build upon the tiling transformation, that we use to balance on-chip memory cost and off-chip bandwidth. The design space we explore can be characterized by the following design knobs.