Robin Panda scite author profile

Spatially-tiled architectures, such as Coarse-Grained Reconfigurable Arrays (CGRAs), are powerful architectures for accelerating applications in the digital-signal processing, embedded, and scientific computing domains. In contrast to Field-Programmable Gate Arrays (FPGAs), another common accelerator, they typically time-multiplex their processing elements and are word rather than bit-oriented. These differences lead us to re-examine some of the traditional architecture choices made for FPGAs as we move to these coarser-granularity architectures. In this paper we study the efficiency of time-multiplexing global interconnect as architectures scale from single-bit to multi-bit datapaths.Using the Mosaic infrastructure, we analyzed the design trade-offs involved in static vs. time-multiplexed routing for global interconnect channels, as well as the benefit of including a dedicated bit-wide control interconnect to supplement the word-wide datapath of a CGRA. We show that a time-multiplexed interconnect is beneficial in these coarsegrained systems, reducing the area-energy product to 0.32× the area-energy product of a fully static interconnect. We also show that for our benchmarks, which include single-bit control logic, providing both word and bit-wide interconnect resources further reduces the area-energy product to 0.94× that of an exclusively word-wide interconnect.

Managing Short-Lived and Long-Lived Values in Coarse-Grained Reconfigurable Arrays

Essen

Wood

et al. 2010

Efficient storage in spatial processors is increasingly important as such devices get larger and support more concurrent operations. Unlike sequential processors that rely heavily on centralized storage, e.g. register files and embedded memories, spatial processors require many small storage structures to efficiently manage values that are distributed throughout the processor's fabric. The goal of this work is to determine the advantages and disadvantages of different architectural structures for storing values on-chip when optimizing for energy efficiency as well as area.Examination of applications for coarse-grained reconfigurable arrays (CGRAs) shows that most values are short-lived; they are produced and consumed quickly, but the distribution of value lifetimes has a reasonably long tail. We take advantage of this distribution to optimize register storage structures for managing short-, medium-, and long-lived values.We show that using a combination of register storage structures, each tailored for values with different lifetimes, provides a reduction in overall area-energy product to 0.69× the area-energy of the baseline architecture, without loss of performance. Finally we provide energy profiles, characteristics, and comparisons of each register structure to enable architects to guide future design choices.

Dynamic Communication in a Coarse Grained Reconfigurable Array

Hauck

2011

Coarse Grained Reconfigurable Arrays (CGRAs) are typically very efficient for a single task. However all functional units are required to perform in lock step, wasting resources and making complex programming flows difficult. Massively Parallel Processor Arrays (MPPAs) excel at executing unrelated tasks simultaneously, but limit the amount of resources dedicated to a single task. We propose an architecture with an MPPA's design flexibility and a CGRA's throughput, capable of processing and transferring data in a pre-compiled schedule, with dynamic transfers between components. Alternative interconnect strategies are compared for silicon area cost and power utilization.

Software Managed Distributed Memories in MPPAs

Hauck

2010

When utilizing reconfigurable hardware there are many applications that will require more memory than is available in a single hardware block. While FPGAs have tools and mechanisms for building logically larger memories, it often requires developer intervention on word-oriented devices like Massively Parallel Processor Arrays (MPPAs). We examine building larger memories on the Ambric MPPA. Building an efficient structure requires low-level development and analysis of latency and bandwidth effects of network and protocol choices. We build a network that only requires only five instructions per transaction after optimization. The resource use and performance suggests architectural enhancements that should be considered for future devices.

Adding dataflow-driven execution control to a Coarse-Grained Reconfigurable Array

Ebeling

Hauck

2012