FLECSim-SoC: A Flexible End-to-End Co-Design Simulation Framework for System on Chips

Hotfilter, Tim; Hoefer, Julian; Kres, Fabian; Kempf, Fabian; Becker, Juergen

doi:10.1109/socc52499.2021.9739212

Cited by 6 publications

(7 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Abstract simulators solve the lack of scalability in FPGA and the slow RTL simulation by lowering its simulation accuracy. Due to its faster simulation, abstract platforms can execute an OS [19,20,22,24] or emulate the OS behavior in high-level while adding delays similar to the real OS behavior in the target system [18]. Only two proposals in this category support peripherals (MPSoCSim and Mack et al).…”

Section: Discussionmentioning

confidence: 99%

Chronos-V: A Many-Core High-level Model with Support for Management Techniques

Weber

Zotto

Moraees

2022

Preprint

View full text Add to dashboard Cite

This work presents Chronos-V, a Many-Core System-on-Chip (MCSoC) that adopts abstract hardware modeling, executing the FreeRTOS Operating System (OS) at each processing element (PE). Chronos-V is a heterogeneous architecture with two regions: (i) General Purpose Processing Elements (GPPE), responsible for executing user applications; (ii) peripherals that provide IO capabilities or hardware acceleration to the system. Besides the standard goal of high-level models, design space exploration at early design stages with reduced simulation time, our goal is to advance the state-of-the-art in the MCSoC research field by proposing an architecture with hardware and software support for management techniques. As a case study, we present an ODA (Observe-Decide-Actuate) loop for thermal management, comparing it to a dark silicon patterning mapping in a platform with 196 PEs. Thermal maps show the benefits of using dynamic thermal management in terms of hotspot avoidance and temperature reduction.

show abstract

Section: Discussionmentioning

confidence: 99%

Chronos-V: A Many-Core High-level Model with Support for Management Techniques

Weber

Zotto

Moraees

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…where X ∈ R C is the input vector of neurons, W ∈ R K ×C is the weight matrix, b ∈ R K is the bias array, Y ∈ R K is the output array, C and K are the number of input and output activations processed by the FC layer, respectively. By applying (4) to each of the four real variables in (5), setting their own quantized ranges a priori, and moving the quantized output array Y q,k to the left hand side, we obtain the quantized FC expression valid for the k-th output activation:…”

Section: ) Integer-only Dnn Kernelsmentioning

confidence: 99%

“…In applications that require utmost accuracy, a common choice is to use 16 bits to quantize activations and weights. Some examples are safety-critical applications, such as image segmentation in foggy environments for autonomous driving [5]; others are image processing applications that work with highresolution satellite images, or high dynamic range (HDR) images and super-resolution [27]. 8 bits is the default precision to quantize DNNs while avoiding performance degradation [27] and is therefore the most commonly used.…”

Section: Sum-together Multipliersmentioning

confidence: 99%

“…Recently, there has been a growing interest in academia and industry towards Mixed-Precision Quantization (MPQ) [2]. This technique leverages the different sensitivity to quantization of each DNN layer [3], [4] to search for the optimal number of activation and weight bits for each individual layer, enabling accuracy vs latency and accuracy vs energy trade-offs [5].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

High-Level Design of Precision-Scalable DNN Accelerators Based on Sum-Together Multipliers

Urbinati,

Casu

2024

IEEE Access

View full text Add to dashboard Cite

Precison-scalable (PS) multipliers are gaining traction in Deep Neural Network accelerators, particularly for enabling mixed-precision (MP) quantization in Deep Learning at the edge. This paper focuses on the Sum-Together (ST) class of PS multipliers, which are subword-parallel multipliers that can execute a standard multiplication at full precision or a dot-product with parallel low-precision operands. Our contributions in this area encompass multiple aspects: we enrich our previous comparison of SoA ST multipliers by including our recent radix-4 Booth ST multiplier and two novel designs; we extend the explanation of the architecture and the design flow of our previously proposed ST-based PS hardware accelerators designed for 2D-Convolution, Depth-wise Convolution, and Fully-Connected layers that we developed using High-Level Synthesis (HLS); we implement the uniform integer quantization equations in hardware; we conduct a broad HLS-driven design space exploration of our ST-based accelerators, varying numerous hardware parameters; finally, we showcase the advantages of ST-based accelerators when integrated into System-on-Chips (SoCs) in three different scenarios (low-area, low-power, and low-latency), running inference on MP-quantized MLPerf Tiny models as case study. Across the three scenarios, the results show an average latency speedup of 1.46x, 1.33x, and 1.29x, a reduced energy consumption in most of the cases, and a marginal area overhead of 0.9%, 2.5% and 8.0%, compared to SoCs with accelerators based on fixed-precision 16-bit multipliers. To sum up, our work provides a comprehensive understanding of ST-based accelerators' performance in an SoC context, paving the way for future enhancements and the solution of identified inefficiencies.

show abstract

“…Enabling rapid simulation of low power and low latency Application-Specific Integrated Circuit (ASIC) designs, consequently, requires a fast and flexible framework that allows evaluation of various CNN hardware accelerators. Frameworks such as FLECSim [15] provide cycle-accurate simulation and evaluation of a broad range of accelerators. However, analyzing each cycle leads to enormous simulation time and is therefore not applicable for our proposed simulation toolchain.…”

Section: A Sensor Nodementioning

confidence: 99%

Hardware-aware Partitioning of Convolutional Neural Network Inference for Embedded AI Applications

Kres

Hoefer

Hotfilter

et al. 2022

2022 18th International Conference on Distributed Computing in Sensor Systems (DCOSS)

Self Cite

View full text Add to dashboard Cite

Embedded image processing applications like multicamera-based object detection or semantic segmentation are often based on Convolutional Neural Networks (CNNs) to provide precise and reliable results. The deployment of CNNs in embedded systems, however, imposes additional constraints such as latency restrictions and limited energy consumption in the sensor platform. These requirements have to be considered during hardware/software co-design of embedded Artifical Intelligence (AI) applications. In addition, the transmission of uncompressed image data from the sensor to a central edge node requires large bandwidth on the link, which must also be taken into account during the design phase.Therefore, we present a simulation toolchain for fast evaluation of hardware-aware CNN partitioning for embedded AI applications. This approach explores an efficient workload distribution between sensor nodes and a central edge node. Neither processing all layers close to the sensor nor transmitting all uncompressed raw data to the edge node is an optimal solution for each use case. Hence, our proposed simulation toolchain evaluates power and performance metrics for each reasonable partitioning point in a CNN. In contrast to the state of the art, our approach does not only consider the neural network architecture. In the evaluation, our simulation toolchain additionally takes into account hardware components such as special accelerators and memories that are implemented in the sensor node.Exemplary, we show the simulation results for three commonly used CNNs in embedded systems. Thereby, we identify advantageous partitioning points regarding inference latency and energy consumption. With the support of the toolchain, we are able to identify three beneficial partitioning points for FCN ResNet-50 and two for GoogLeNet as well as for SqueezeNet V1.1.

show abstract

FLECSim-SoC: A Flexible End-to-End Co-Design Simulation Framework for System on Chips

Cited by 6 publications

References 19 publications

Chronos-V: A Many-Core High-level Model with Support for Management Techniques

Chronos-V: A Many-Core High-level Model with Support for Management Techniques

High-Level Design of Precision-Scalable DNN Accelerators Based on Sum-Together Multipliers

Hardware-aware Partitioning of Convolutional Neural Network Inference for Embedded AI Applications

Contact Info

Product

Resources

About