reMORPH: A Runtime Reconfigurable Architecture

Paul, Kolin; Dash, Chinmaya; Moghaddam, Mansureh S.

doi:10.1109/dsd.2012.111

Cited by 24 publications

(19 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An alternative solution is to build arrays of customized TM FUs and interconnect on the FPGA, similar to CGRAs [17]. A number of different interconnect styles for connecting between FUs can be used, with the most common being: island style [6], [8], nearest neighbor [20], [16] and to a lesser extent linear interconnect [3], [9]. The overhead of the interconnect network, particularly for island style and nearest neighbor interconnects, contribute to a significant FPGA resource utilization.…”

Section: Related Workmentioning

confidence: 99%

“…The reMORPH overlay [20] better targets the FPGA fabric, with an FU consuming 1 DSP Block, 3 block RAMs, 196 LUTs and 41 registers. To reduce overhead, the reMORPH FU does not use decoders resulting in a 72-bit instruction memory (supporting up to 512 instructions) which also over utilizes the BRAMs.…”

Section: Related Workmentioning

confidence: 99%

“…NOPs (equal to IWP-1) must be added between dependant instructions (DFG nodes) unless other non-dependant instructions can be scheduled in between. For example, in the first (top) cluster, Node 17 is scheduled, followed by 13,25,9,20, and 12, before 15 is scheduled. Hence, the dependency between 17 and 15 is resolved and no NOPs are inserted.…”

Section: Compiling To the Overlaymentioning

confidence: 99%

See 2 more Smart Citations

A time-multiplexed FPGA overlay with linear interconnect

Jain

Maskell

et al. 2018

2018 Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE)

View full text Add to dashboard Cite

Coarse-grained overlays improve FPGA design productivity by providing fast compilation and software like programmability. Soft processor based overlays with well-defined ISAs are attractive to application developers due to their ease of use. However, these overlays have significant FPGA resource overheads. Time multiplexed (TM) CGRA-like overlays represent an interesting alternative as they are able to change their behavior on a cycle by cycle basis while the compute kernel executes. This reduces the FPGA resource needed, but at the cost of a higher initiation interval (II) and hence reduced throughput.The fully flexible routing network of current CGRA-like overlays results in high FPGA resource usage. However, many application kernels are acyclic and can be implemented using a much simpler linear feed-forward routing network. This paper examines a DSP block based TM overlay with linear interconnect where the overlay architecture takes account of the application kernels' characteristics and the underlying FPGA architecture, so as to minimize the II and the FPGA resource usage. We examine a number of architectural extensions to the DSP block based functional unit to improve the II, throughput and latency. The results show an average 70% reduction in II, with corresponding improvements in throughput and latency.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Compiling To the Overlaymentioning

confidence: 99%

See 1 more Smart Citation

A time-multiplexed FPGA overlay with linear interconnect

Jain

Maskell

et al. 2018

2018 Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE)

View full text Add to dashboard Cite

show abstract

“…Paul et al have reported that if the reconfiguration is limited to changing the connectivity at runtime, the overheads are typically very low [1]. This allows fairly large circuits to be implemented modularly in a time multiplexed manner which makes the implementation area and power efficient.…”

Section: D-fft Architecturementioning

confidence: 99%

“…reMORPH [1] is a reconfigurable nearest neighbor mesh connected array of coarse grain reconfigurable tiles as illustrated in Figure 2. Modern FPGAs have hard DSP macros and lots of embedded memory which have been used to design the processing element (PE) to operate at 400 MHz with a very low footprint of 200 slice LUTs.…”

Section: D-fft Architecturementioning

confidence: 99%

High performance 3D-FFT implementation

Nidhi

Paul

Hemani

et al. 2013

2013 IEEE International Symposium on Circuits and Systems (ISCAS2013)

Self Cite

View full text Add to dashboard Cite

3D FFT is a very data and compute intensive kernel encountered in many applications. We report a high performance design and implementation of 3D-FFT on a CGRA which supports partial reconfiguration. The hardware software multi clock design uses dynamic reconfiguration to reduce the required communication bandwidth to achieve a sustained throughput of 40 GOPS on a wordsize of 48 bits. Performance metrics including overheads and speed over software for implementations of up to 256 point 3D-FFT have been presented in the paper.

show abstract