Programmable hardware built on a regular architecture can partially alleviate the problem of increased defect densities associated with transistor scaling by dynamically wiring around the defects [1]. The fine granularity of FPGAs is however unsuitable for effectively exploiting runtime reconfiguration because of the high overheads involved. A coarse grain reconfigurable array with malleable communication links -reMORPH -is proposed in this paper. The compute tile uses DSP48E and BRAM embedded blocks in a Xilinx FPGA and has a very low footprint of about 200 slice LUTs. The semi-systolic near neighbour communication interconnect can be dynamically reconfigured for each "epoch" of computation. The "epoch" or phases of the application are obtained via profiling or static data flow analysis. Some of the links between the compute tiles are changed during the reconfiguration phase which drastically reduces the context switch overhead enabling high performance/area applications to be built on this fabric.
Modulo-scheduled course-grain reconfigurable array (CGRA) processors excel at exploiting loop-level parallelism at a high performance per watt ratio. The frequent reconfiguration of the array, however, causes between 25% and 45% of the consumed chip energy to be spent on the instruction memory and fetches therefrom. This article presents a hardware/software codesign methodology for such architectures that is able to reduce both the size required to store the modulo-scheduled loops and the energy consumed by the instruction decode logic. The hardware modifications improve the spatial organization of a CGRA's execution plan by reorganizing the configuration memory into separate partitions based on a statistical analysis of code. A compiler technique optimizes the generated code in the temporal dimension by minimizing the number of signal changes. The optimizations achieve, on average, a reduction in code size of more than 63% and in energy consumed by the instruction decode logic by 70% for a wide variety of application domains. Decompression of the compressed loops can be performed in hardware with no additional latency, rendering the presented method ideal for low-power CGRAs running at high frequencies. The presented technique is orthogonal to dictionary-based compression schemes and can be combined to achieve a further reduction in code size. CCS Concepts: • Computer systems organization → Reconfigurable computing; • Hardware → Power estimation and optimization; • Software and its engineering → Compilers;
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.