Anja Niedermeier scite author profile

This thesis was typeset using L A T E X, TikZ, and GNU Emacs. This thesis was printed by Gildeprint, The Netherlands. AbstractData-driven streaming applications are quite common in modern multimedia and wireless applications, like for example video and audio processing. The main components of these applications are Digital Signal Processing (DSP) algorithms.These algorithms are not extremely complex in terms of their structure and the operations that make up the algorithms are fairly simple (usually binary mathematical operations like addition and multiplication). What makes it challenging to implement and execute these algorithms efficiently is their large degree of fine-grained parallelism and the required throughput.DSP algorithms can usually be described as dataflow graphs with nodes corresponding to operations and edges between the nodes expressing data dependencies. On the edges, data travels in the form of tokens. A node fires as soon as all required input data has arrived at its input edge(s). One firing consists of consuming the input data (i.e. input tokens), executing the desired operation, and producing the output data (i.e. output tokens). Usually, input data to the dataflow graph is provided as a stream of tokens. As a consequence, a well-behaved dataflow graph keeps executing as long as input data arrives.To execute DSP algorithms efficiently while maintaining flexibility, coarse-grained reconfigurable arrays (CGRAs) can be used. CGRAs are composed of a set of small, reconfigurable cores, interconnected in e.g. a two dimensional array. Each core by itself is not very powerful, yet the complete array of cores forms an efficient architecture with a high throughput due to its ability to efficiently execute operations in parallel.To program CGRAs, usually an architecture-specific subset of C is defined which is then used to specify and implement algorithms on the respective CGRA. However, the C programming paradigm was not developed to specify algorithms that contain a large degree of fine-grained parallelism. Instead, it was designed to implement sequential algorithms on single-core architectures.In this thesis, we present a CGRA targeted at data-driven streaming DSP applications that contain a large degree of fine-grained parallelism, such as matrix manipulations or filter algorithms. Along with the architecture, also a programming language is presented that can directly describe DSP applications as dataflow graphs which are then automatically mapped and executed on the architecture. v viIn contrast to previously published work on CGRAs, the guiding principle and inspiration for the presented CGRA and its corresponding programming paradigm is the dataflow principle. Three main aspects can be named here:1. A DSP algorithm is represented as a dataflow graph with nodes corresponding to operations and edges between the nodes corresponding to data dependencies. 2. The configuration and execution principles of the cores in the architecture are based on dataflow principles, i.e. a core starts its executi...

Dataflow-based reconfigurable architecture for streaming applications

Kuper

Smit

2012

Coarse-grain reconfigurable arrays often rely on an imperative programming approach including a read/write mechanism for memory access. In this paper, we present an architecture composed of a configurable array of computing cores and memory blocks in which both the execution mechanism and configuration principle of the computing cores and the behaviour of the memory blocks are based on streaming and dataflow principles. We illustrate our ideas with the implementation of a long finite impulse response (FIR) filter where memory tiles are used to store intermediate results. I. MOTIVATION AND RELATED WORKStreaming application are common in modern multimedia and wireless applications, like for example Video and Audio processing. In streaming applications, efficiency can drastically be increased if the underlying execution mechanism is based on dataflow principles, i.e. the system starts the execution as soon as the required input data is available, in contrast to conventional load/store mechanism commonly found in imperative approaches.In embedded computing, coarse-grain reconfigurable architectures are an emerging paradigm for efficient implementations of streaming systems. Those architectures usually combine a general purpose processor (GPP) as host controller with an array of small, reconfigurable processing elements that are interconnected to form a larger, reconfigurable multicore architecture. Cores in those arrays are usually small and contain only the ALU and some local storage. Often bigger external memory is added to be able to store for example intermediate results or to provide look-up tables.There have already been published a number of papers on coarse-grain reconfigurable architectures, an extended overview on reconfigurable architectures can be found in [1] and [2]. MorphoSys [3] is a hybrid of a host CPU and a reconfigurable array. The connection to external memory is provided via the system bus with DMA. The programming principle is based on imperative programming. ADRES [4] is a combination of a very long instruction word (VLIW) processor with a tight connection to a reconfigurable grid. The two parts are connected via a multi-port register file. Data access is performed via load/store operations. The programming is C-based. XPP [5] contains an array of 8x8 processing elements including 2 RAM-blocks per row. The XPP array can be programmed either with the lowlevel NML (native mapping language) or with an XPP-specific subset of C. Memory access is performed with read and write operations. DREAM [6] consists of a control unit, data path and a memory access unit. To transfer data between DREAM and the host CPU, exchange buffers are available. DREAM is programmed using macro-instructions that are described in single-assignment C syntax. RICA [7] is a heterogeneous array of reconfigurable ICs. Dedicated control ICs are available so that RICA does not require a host control CPU. In the array, distributed memory elements are available that are accessed via special memory access ICs. The programming princip...

Designing a dataflow processor using CλaSH

Wester

Rovers

et al. 2010

Abstract-In this paper we show how a simple dataflow processor can be fully implemented using CλaSH, a high level HDL based on the functional programming language Haskell. The processor was described using Haskell, the CλaSH compiler was then used to translate the design into a fully synthesisable VHDL code. The VHDL code was synthesised with 90 nm TSMC libraries and placed and routed. Simulation of the final netlist showed correct behaviour. We conclude that Haskell and CλaSH are well-suited to define hardware on a very high level of abstraction which is close to the mathematical description of the desired architecture. By using CλaSH, the designer does not have to care about internal implementation details like when designing with VHDL. The complete processor was described in 300 lines of code, some snippets are shown as illustration.

A Dataflow Inspired Programming Paradigm for Coarse-Grained Reconfigurable Arrays

Kuper

Smit

2014

In this paper, we present a new approach towards programming coarse-grained reconfigurable arrays (CGRAs) in an intuitive, dataflow inspired way. Based on the observation that available CGRAs are usually programmed using C, which lacks proper support for instruction-level parallelism, we instead started from a dataflow perspective combined with a language that inherently supports parallel structures. Our programming paradigm decouples the local functionality of a core from the global flow of data, i.e. the kernels from the routing. We will describe the ideas of our programming paradigm and also the language and compiler itself. Our complete system, including the CGRA, the programming language and the compiler, was developed using Haskell, which leads to a complete, sound system. We finish the paper with the implementation of a number of algorithms using our system. Motivation and Related WorkMany algorithms common in digital signal processing (DSP), like for example audio filtering, contain a high degree of instruction-level parallelism. To accelerate those algorithms, coarse-grained reconfigurable arrays (CGRAs) are often used due to their capability of large-scale parallelism. A CGRA is an array of small, configurable cores, often in combination with a general purpose processor for control operations. The cores in the CGRA usually contain an ALU and a small local memory.Popular examples of CGRAs are MorphoSys (2000) [1], XPP (2003) [2], ADRES (2003) [3] and SmartCell (2010) [4]. Since the details of the mentioned CGRAs is out of scope of this paper, the reader is referred to the respective papers and to the surveys [5] and [6], where a good overview on CGRAs is given.This research is conducted as part of the Sensor Technology Applied in Reconfigurable systems for sustainable Security (STARS) project (www.starsproject.nl)

A dataflow-inspired CGRA for streaming applications

Kuper

Smit

2013