Low-cost guaranteed-throughput dual-ring communication infrastructure for heterogeneous MPSoCs

Dekens, Berend H.J.; Wilmanns, Philip S.; Smit, Gerard J. M.; Bekooij, Marco

doi:10.1109/dasip.2014.7115628

Cited by 5 publications

(12 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The BCETs and WCETs of the tasks, which are denoted next to the corresponding dataflow actors, represent theoretical bounds on actual execution times on the Starburst platform [4]. The bounds adhere to the requirements given in Section IV-A, i.e.…”

Section: B Cyclic Applicationsmentioning

confidence: 99%

See 1 more Smart Citation

Combining Offsets with Precedence Constraints to Improve Temporal Analysis of Cyclic Real-Time Streaming Applications

Kurtin

Hausmans

Bekooij

2016

2016 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS)

View full text Add to dashboard Cite

Stream processing applications executed on multiprocessor systems usually contain cyclic data dependencies due to the presence of bounded FIFO buffers and feedback loops, as well as cyclic resource dependencies due to the usage of shared processors. In recent works it has been shown that temporal analysis of such applications can be performed by iterative fixed-point algorithms that combine dataflow and response time analysis techniques. However, these algorithms consider resource dependencies based on the assumption that tasks on shared processors are enabled simultaneously, resulting in a significant overestimation of interference between such tasks. This paper extends these approaches by integrating an explicit consideration of precedence constraints with a notion of offsets between tasks on shared processors, leading to a significant improvement of temporal analysis results for cyclic stream processing applications. Moreover, the addition of an iterative buffer sizing enables an improvement of temporal analysis results for acyclic applications as well.The performance of the presented approach is evaluated in a case study using a WLAN transceiver application. It is shown that 56% higher throughput guarantees and 52% smaller end-to-end latencies can be determined compared to state-of-the-art. I. INTRODUCTIONReal-time stream processing applications such as Software Defined Radios (SDRs) that are executed on multiprocessor systems usually require to give temporal guarantees at design time, ensuring that throughput and latency constraints can be always satisfied. In many cases, however, a temporal analysis to obtain such guarantees is not trivial, as both cyclic data dependencies and processor sharing with run-time schedulers heavily influence the temporal behavior of an analyzed application. Besides, so-called cyclic resource dependencies occur wherever resource dependencies introduced by processor sharing are opposite to the flow of data, making temporal analysis even more challenging.It has been shown that dataflow analysis techniques are capable of obtaining temporal guarantees under such demanding circumstances [17], [19], [23]. The applicability of dataflow analysis techniques is not limited to temporal analysis, but also includes the computation of required buffer capacities [24], scheduler settings [22], a suitable task-to-processor assignment [18] and forms the basis for synchronization overhead minimization techniques such as task clustering [5] and resynchronization [8].Especially the inherent support of cyclic data dependencies, which is enabled by the so-called the-earlier-the-better refinement relation [7], distinguishes dataflow analysis techniques from other approaches. Cyclic data dependencies regularly occur due to the presence of feedback loops. Moreover, cyclic data dependencies are also introduced by the usage of FirstIn-First-Out (FIFO) buffers with blocking writes for intertask communication, i.e. buffers on which a writing task is suspended if the buffer is full. Data dependencies become cy...

show abstract

Section: B Cyclic Applicationsmentioning

confidence: 99%

“…Our analysis requires BCETs and WCETs of all tasks that hold independent of schedules. The underlying hardware must support obtaining these times, as does for instance the Starburst architecture [4]. Thereby it can be assumed that all tasks are executed in isolation, since processor sharing is analyzed by our algorithm.…”

Section: A Analysis Modelmentioning

confidence: 99%

Combining Offsets with Precedence Constraints to Improve Temporal Analysis of Cyclic Real-Time Streaming Applications

Kurtin

Hausmans

Bekooij

2016

2016 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS)

View full text Add to dashboard Cite

show abstract

“…Multiplexing of data streams is supported where state is retained within the accelerators, limiting the number of multiplexed streams per accelerator. Eclipse [10] employs a shell to integrate stream processing accelerators, an approach similar to the network [11] used in our design. The use of the C-FIFO [12] algorithm in a hardware shell results in larger hardware costs than our credit-based hardware flow-control support in our network.…”

Section: Related Workmentioning

confidence: 99%

“…Our approach of using a network to stream data between accelerators and a dedicated bus to save and restore accelerator state is similar to [13]. However, the use of a switch to implement point-to-point connections results in higher hardware costs compared to the ring-based interconnect of [14], [11] which is used in this paper. The interconnect from [13] provides real-time guarantees but lacks support to share accelerators.…”

Section: Related Workmentioning

confidence: 99%

Real-Time Multiprocessor Architecture for Sharing Stream Processing Accelerators

Dekens

Bekooij

Smit

2015

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Self Cite

View full text Add to dashboard Cite

Stream processing accelerators are often applied in MPSoCs for software defined radios. Sharing of these accelerators between different streams could improve their utilization and reduce thereby the hardware cost but is challenging under real-time constraints.In this paper we introduce entry-and exit-gateways that are responsible for multiplexing blocks of data over accelerators under real-time constraints. These gateways check for the availability of sufficient data and space and thereby enable the derivation of a dataflow model of the application. The dataflow model is used to verify the worst-case temporal behavior based on the sizes of the blocks of data used for multiplexing. We demonstrate that required buffer capacities are non-monotone in the block size. Therefore, an ILP is presented to compute minimum block sizes and sufficient buffer capacities.The benefits of sharing accelerators are demonstrated using a multi-core system that is implemented on a Virtex 6 FPGA. A stereo audio stream from a PAL video signal is demodulated in this system in real-time where two accelerators are shared within and between two streams. In this system sharing reduces the number of accelerators by 75% and reduced the number of logic cells with 63%.

show abstract

“…Therefore, SDF can be used to analyse if an application, which is modelled as an SDF graph, meets all its Quality of Service (QoS) requirements [92]. SDF can, for example, be used to model the latency and rate characteristics of data streams over a predictable interconnect like a ring network [29].…”

Section: Synchronous Dataflowmentioning

confidence: 99%

A fine-grained parallel dataflow-inpsired architecture for streaming applications

Niedermeier¹

View full text Add to dashboard Cite

This thesis was typeset using L A T E X, TikZ, and GNU Emacs. This thesis was printed by Gildeprint, The Netherlands. AbstractData-driven streaming applications are quite common in modern multimedia and wireless applications, like for example video and audio processing. The main components of these applications are Digital Signal Processing (DSP) algorithms.These algorithms are not extremely complex in terms of their structure and the operations that make up the algorithms are fairly simple (usually binary mathematical operations like addition and multiplication). What makes it challenging to implement and execute these algorithms efficiently is their large degree of fine-grained parallelism and the required throughput.DSP algorithms can usually be described as dataflow graphs with nodes corresponding to operations and edges between the nodes expressing data dependencies. On the edges, data travels in the form of tokens. A node fires as soon as all required input data has arrived at its input edge(s). One firing consists of consuming the input data (i.e. input tokens), executing the desired operation, and producing the output data (i.e. output tokens). Usually, input data to the dataflow graph is provided as a stream of tokens. As a consequence, a well-behaved dataflow graph keeps executing as long as input data arrives.To execute DSP algorithms efficiently while maintaining flexibility, coarse-grained reconfigurable arrays (CGRAs) can be used. CGRAs are composed of a set of small, reconfigurable cores, interconnected in e.g. a two dimensional array. Each core by itself is not very powerful, yet the complete array of cores forms an efficient architecture with a high throughput due to its ability to efficiently execute operations in parallel.To program CGRAs, usually an architecture-specific subset of C is defined which is then used to specify and implement algorithms on the respective CGRA. However, the C programming paradigm was not developed to specify algorithms that contain a large degree of fine-grained parallelism. Instead, it was designed to implement sequential algorithms on single-core architectures.In this thesis, we present a CGRA targeted at data-driven streaming DSP applications that contain a large degree of fine-grained parallelism, such as matrix manipulations or filter algorithms. Along with the architecture, also a programming language is presented that can directly describe DSP applications as dataflow graphs which are then automatically mapped and executed on the architecture. v viIn contrast to previously published work on CGRAs, the guiding principle and inspiration for the presented CGRA and its corresponding programming paradigm is the dataflow principle. Three main aspects can be named here:1. A DSP algorithm is represented as a dataflow graph with nodes corresponding to operations and edges between the nodes corresponding to data dependencies. 2. The configuration and execution principles of the cores in the architecture are based on dataflow principles, i.e. a core starts its executi...

show abstract

Low-cost guaranteed-throughput dual-ring communication infrastructure for heterogeneous MPSoCs

Cited by 5 publications

References 17 publications

Combining Offsets with Precedence Constraints to Improve Temporal Analysis of Cyclic Real-Time Streaming Applications

Combining Offsets with Precedence Constraints to Improve Temporal Analysis of Cyclic Real-Time Streaming Applications

Real-Time Multiprocessor Architecture for Sharing Stream Processing Accelerators

A fine-grained parallel dataflow-inpsired architecture for streaming applications

Contact Info

Product

Resources

About