Martin Flasskamp scite author profile

Abstract-MPSoCs with hierarchical communication infrastructures are promising architectures for low power embedded systems. Multiple CPU clusters are coupled using an Network-onChip (NoC). Our CoreVA-MPSoC targets streaming applications in embedded systems, like signal and video processing. In this work we introduce a tightly coupled shared data memory to each CPU cluster, which can be accessed by all CPUs of a cluster and the NoC with low latency. The main focus is the comparison of different memory architectures and their connection to the NoC. We analyze memory architectures with local data memory only, shared data memory only, and a hybrid architecture integrating both. Implementation results are presented for a 28 nm FD-SOI standard cell technology. A CPU cluster with shared memory shows similar area requirements compared to the local memory architecture. We use post place and route simulations for precise analysis of energy consumption on both cluster and NoC level using the different memory architectures. An architecture with shared data memory shows best performance results in combination with a high resource efficiency. On average, the use of shared memory shows a 17.2% higher throughput for a benchmark suite of 10 applications compared to the use of local memory only. the communication infrastructure goes the on-chip memory architecture, which also has a huge impact on performance and energy efficiency. The main focus of this paper is the comparison of different memory architectures and their interaction with the NoC for many core systems. Compared to traditional processor systems, lots of many cores feature a different memory management, which changes the requirements on memory and NoC infrastructure. Traditional processor systems use a memory hierarchy with several (private and shared) on-chip caches, external DRAM, and a unified address space. This allows for easy programming, but results in unpredictable memory access times. Additionally, the cache logic and the coherence handling require a high amount of chip area and power. Therefore, a lot of Many-Core systems omit data caches and use software-managed scratchpad memories instead, which provide a resource-efficient alternative [1]. For performance reasons, the scratchpad memories are tightly attached to each CPU and communication between CPUs is initiated by software. In [2] we showed that area and power consumption of a single CoreVA CPU's data memory increases by 10%, when using a cache instead of scratchpad memory. Due to cache coherence issues it can be expected that these values will even increase for a cache-based many core system. Additionally, software-managed scratchpad memories gives full control of data communication to the programmer or an automatic partitioning tool (cf. Section III-E) and allows for a more accurate performance estimation.The many core architecture considered in this work is our CoreVA-MPSoC, which targets streaming applications in embedded and energy-limited systems. Examples for streaming applications are signal pr...

A communication model and partitioning algorithm for streaming applications for an embedded MPSoC

Kelly

Flasskamp

et al. 2014

Energy efficient embedded computing enables new application scenarios in mobile devices like software-defined radio and video processing. The hierarchical multiprocessor considered in this work may contain dozens or hundreds of resource efficient VLIW CPUs. Programming this number of CPU cores is a complex task requiring compiler support. The stream programming paradigm provides beneficial properties that help to support automatic partitioning. This work describes a compiler for streaming applications targeting the self-build hierarchical CoreVA-MPSoC multiprocessor platform. The compiler is supported by a programming model that is tailored to fit the streaming programming paradigm. We present a novel simulatedannealing (SA) based partitioning algorithm, called Smart SA. The overall speedup of Smart SA is 12.84 for an MPSoC with 16 CPU cores compared to a single CPU implementation. Comparison with a state of the art partitioning algorithm shows an average performance improvement of 34.07%. I . I N T R O D U C T I O NThe decreasing feature size of microelectronic circuits allows for the integration of more and more processing cores on a single chip. A Multiprocessor System-on-Chip (MPSoC) may consist of dozens of processing elements as CPU cores or specialized hardware accelerators connected by a highspeed communication infrastructure, i.e. a Network-On-Chip (NoC). However, mapping general purpose applications to a large number of MPSoC processing elements remains a nontrivial task. Manually writing low-level code for each core makes it difficult to experiment with different decompositions and mappings of computation to processors. Alternatively, higher-level programming frameworks allow the compiler to evaluate a larger design-space when mapping the application to different hardware configurations. Efficient mapping algorithms are important for finding optimized solutions. The Streaming paradigm provides regular and repeating computation and independent filters with explicit communication. This allows compilers to exploit the task more easily, data and pipeline parallelism commonly found in signal processing, multimedia, network processing, cryptology and similar application domains.A popular stream based programming language is StreamIt [1], [2]. The key principle of this language is to provide information about inherent parallelism of the program by using a structured data flow graph. This graph consisting of filters, pipelines, split-joins, and feedback loops.In this paper we present a compiler for the StreamIt Language targeting the self-build CoreVA-MPSoC architecture. The CoreVA-MPSoC is a highly scalable multiprocessor system based on a hierarchical communication infrastructure and the configurable VLIW 1 processor CoreVA.This paper is organized as follows: Section II describes our CoreVA-MPSoC hardware architecture. In Section III we discuss our StreamIt compiler with a focus on our novel simulated annealing partitioning algorithm (Smart SA). The communication model proposed in this work is presented in S...

Performance estimation of streaming applications for hierarchical MPSoCs

Flasskamp

et al. 2016

Parallel programming and effective partitioning of applications for embedded many-core architectures requires optimization algorithms. However, these algorithms have to quickly evaluate thousands of different partitions. We present a fast performance estimator embedded in a parallelizing compiler for streaming applications. The estimator combines a single execution-based simulation and an analytic approach. Experimental results demonstrate that the estimator has a mean error of 2.6% and computes its estimation 2848 times faster compared to a cycle accurate simulator.

System-Level Analysis of Network Interfaces for Hierarchical MPSoCs

Flasskamp

et al. 2015

Network Interfaces (NIs) are used in Multiprocessor Systemon-Chips (MPSoCs) to connect CPUs to a packet switched Network-on-Chip. In this work we introduce a new NI architecture for our hierarchical CoreVA-MPSoC. The CoreVAMPSoC targets streaming applications in embedded systems. The main contribution of this paper is a system-level analysis of different NI configurations, considering both software and hardware costs for NoC communication. Different configurations of the NI are compared using a benchmark suite of 10 streaming applications. The best performing NI configuration shows an average speedup of 20 for a CoreVA-MPSoC with 32 CPUs compared to a single CPU. Furthermore, we present physical implementation results using a 28 nm FD-SOI standard cell technology. A hierarchical MPSoC with 8 CPU clusters and 4 CPUs in each cluster running at 800 MHz requires an area of 4.56 mm 2 .

Comparison of Shared and Private L1 Data Memories for an Embedded MPSoC in 28nm FD-SOI

Daberkow

et al. 2015