Microarchitectural Comparison of the MXP and Octavo Soft-Processor FPGA Overlays

LaForest, Charles Eric; Anderson, Jason

doi:10.1145/3053679

Cited by 9 publications

(7 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most successful TM overlays are based on soft processors. The more performance oriented ones include, SIMD Octavo [13], VectorBlox MXP [24] and VLIW TILT [19]. A massively parallel overlay, called GRVI Phalanx [7], based on the RISC-V processor and the Hoplite NOC [11] mapped 1680 RISC-V cores onto an UltraScale+ VU9P.…”

Section: Related Workmentioning

confidence: 99%

“…NOPs (equal to IWP-1) must be added between dependant instructions (DFG nodes) unless other non-dependant instructions can be scheduled in between. For example, in the first (top) cluster, Node 17 is scheduled, followed by 13,25,9,20, and 12, before 15 is scheduled. Hence, the dependency between 17 and 15 is resolved and no NOPs are inserted.…”

Section: Compiling To the Overlaymentioning

confidence: 99%

See 1 more Smart Citation

A time-multiplexed FPGA overlay with linear interconnect

Jain

Maskell

et al. 2018

2018 Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE)

View full text Add to dashboard Cite

Coarse-grained overlays improve FPGA design productivity by providing fast compilation and software like programmability. Soft processor based overlays with well-defined ISAs are attractive to application developers due to their ease of use. However, these overlays have significant FPGA resource overheads. Time multiplexed (TM) CGRA-like overlays represent an interesting alternative as they are able to change their behavior on a cycle by cycle basis while the compute kernel executes. This reduces the FPGA resource needed, but at the cost of a higher initiation interval (II) and hence reduced throughput.The fully flexible routing network of current CGRA-like overlays results in high FPGA resource usage. However, many application kernels are acyclic and can be implemented using a much simpler linear feed-forward routing network. This paper examines a DSP block based TM overlay with linear interconnect where the overlay architecture takes account of the application kernels' characteristics and the underlying FPGA architecture, so as to minimize the II and the FPGA resource usage. We examine a number of architectural extensions to the DSP block based functional unit to improve the II, throughput and latency. The results show an average 70% reduction in II, with corresponding improvements in throughput and latency.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Compiling To the Overlaymentioning

confidence: 99%

A time-multiplexed FPGA overlay with linear interconnect

Jain

Maskell

et al. 2018

2018 Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE)

View full text Add to dashboard Cite

show abstract

“…To improve power consumption and throughput, smaller and faster processor architectures, such as the iDEA processor [22], have been proposed. Examples of multi-threaded and parallel processors include: CUSTARD [23], Octavo [24] and SIMD-Octavo [25], The VectorBlox MXP soft vector processor [26] and the TILT VLIW processor [27].…”

Section: B Time-multiplexed Overlaysmentioning

confidence: 99%

FPGA Overlays Hardware based Computing for the Masses

Phung¹,

Maskell²,

Li³

2018

Eighth International Conference on Advances in Computing, Electronics and Electrical Technology - CEET 2018

View full text Add to dashboard Cite

The hardware acceleration of compute intensive applications has definite advantages, particularly in terms of energy and application latency. Heterogeneous programmable system-on-chip (SoCs) FPGA devices, which combine general purpose processors with reconfigurable fabrics, provide a compelling platform for IoT applications. However, FPGA devices are constrained due to significant design productivity issues and a lack of suitable hardware abstraction. For FPGAs to compete as general purpose computing platforms they must be better virtualized, as eliminating the need to work with platform-specific details would make them more accessible to application developers who are accustomed to software API abstractions and fast development cycles. In this paper, we discuss the role of overlay architectures for enabling general purpose FPGA application acceleration.

show abstract

“…FPGA overlay architectures [13], [14], [15], [16], [17] built around runtime programmable hardware blocks have emerged as one possible solution to this challenge, offering improved design productivity, by virtue of fast compilation, software-like programmability and run-time management, and high-level design abstraction. Runtime programmable hardware blocks may include (soft) processor arrays [18], [19], DMA engines, SIMD/VLIW engines [20], [21], programmable data-flow engines [22], [23], [24], [25], or Network-on-Chip (NoC) nodes [26].…”

Section: Introductionmentioning

confidence: 99%

Coarse Grained FPGA Overlay for Rapid Just-In-Time Accelerator Compilation

Jain

Maskell

Fahmy

2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Coarse-grained FPGA overlays built around the runtime programmable DSP blocks in modern FPGAs can achieve high throughput and improved scalability compared to traditional overlays built without detailed consideration of FPGA architecture. These overlays can be mapped to using higher level compilers, achieving fast compilation, software-like programmability and run-time management, and high-level design abstraction. OpenCL allows programs running on a host computer to launch accelerator kernels which can be compiled at run-time for a specific architecture, thus enabling portability. However, prohibitive hardware compilation times in traditional design flows mean that the tools cannot effectively use just-in-time (JIT) compilation or runtime performance scaling on FPGAs. We present a methodology for runtime compilation of dataflow graphs expressed as OpenCL kernels onto coarse-grained overlays. The methodology benefits from the high level of abstraction afforded by using the OpenCL programming model, while the mapping to the overlay significantly reduces compilation and load times. Key characteristics of this work include highly performant DSP-optimized functional units that scale to large overlays on modern devices and the ability to perform automatic resource-aware kernel replication up to the size of the overlay. We demonstrate place and route times orders of magnitude better than traditional HLS flows, even when running on an embedded processor in the Xilinx Zynq.

show abstract

Microarchitectural Comparison of the MXP and Octavo Soft-Processor FPGA Overlays

Cited by 9 publications

References 26 publications

A time-multiplexed FPGA overlay with linear interconnect

A time-multiplexed FPGA overlay with linear interconnect

FPGA Overlays Hardware based Computing for the Masses

Coarse Grained FPGA Overlay for Rapid Just-In-Time Accelerator Compilation

Contact Info

Product

Resources

About