Memory-Throughput Trade-off for CNN-Based Applications at the Edge

Minakova, Svetlana; Stefanov, Todor

doi:10.1145/3527457

Cited by 6 publications

(5 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since we target a mobile application running the classification locally on the devices, model performance and resource requirements are important. This trade-off currently gains much intention in industrial applications, e.g., in [53]. The approaches to pruning or quantizing the models are state of the art as, e.g., described in [14,15].…”

Section: Related Workmentioning

confidence: 99%

Enhancing Apple Cultivar Classification Using Multiview Images

Krug,

Hutschenreuther

2024

J. Imaging

View full text Add to dashboard Cite

Apple cultivar classification is challenging due to the inter-class similarity and high intra-class variations. Human experts do not rely on single-view features but rather study each viewpoint of the apple to identify a cultivar, paying close attention to various details. Following our previous work, we try to establish a similar multiview approach for machine-learning (ML)-based apple classification in this paper. In our previous work, we studied apple classification using one single view. While these results were promising, it also became clear that one view alone might not contain enough information in the case of many classes or cultivars. Therefore, exploring multiview classification for this task is the next logical step. Multiview classification is nothing new, and we use state-of-the-art approaches as a base. Our goal is to find the best approach for the specific apple classification task and study what is achievable with the given methods towards our future goal of applying this on a mobile device without the need for internet connectivity. In this study, we compare an ensemble model with two cases where we use single networks: one without view specialization trained on all available images without view assignment and one where we combine the separate views into a single image of one specific instance. The two latter options reflect dataset organization and preprocessing to allow the use of smaller models in terms of stored weights and number of operations than an ensemble model. We compare the different approaches based on our custom apple cultivar dataset. The results show that the state-of-the-art ensemble provides the best result. However, using images with combined views shows a decrease in accuracy by 3% while requiring only 60% of the memory for weights. Thus, simpler approaches with enhanced preprocessing can open a trade-off for classification tasks on mobile devices.

show abstract

Section: Related Workmentioning

confidence: 99%

Enhancing Apple Cultivar Classification Using Multiview Images

Krug,

Hutschenreuther

2024

J. Imaging

View full text Add to dashboard Cite

show abstract

“…Quantization and pruning, as explained above, are therefore essential for an efficient implementation. At the same time, the high throughput required by certain applications is very hard to achieve with general purpose hardware, such as CPUs and GPUs, because of limitations in both memory and computation bandwidth, which appeal more to the average case [ 51 ]. While several integrated custom accelerators exist, they cannot be easily reconfigured to compute with arbitrary precision, reducing the hardware utilization and therefore performance.…”

Section: Cnn and Quantization Backgroundmentioning

confidence: 99%

Quantization-Aware NN Layers with High-throughput FPGA Implementation for Edge AI

Pistellato

Bergamasco

Bigaglia

et al. 2023

Sensors

View full text Add to dashboard Cite

Over the past few years, several applications have been extensively exploiting the advantages of deep learning, in particular when using convolutional neural networks (CNNs). The intrinsic flexibility of such models makes them widely adopted in a variety of practical applications, from medical to industrial. In this latter scenario, however, using consumer Personal Computer (PC) hardware is not always suitable for the potential harsh conditions of the working environment and the strict timing that industrial applications typically have. Therefore, the design of custom FPGA (Field Programmable Gate Array) solutions for network inference is gaining massive attention from researchers and companies as well. In this paper, we propose a family of network architectures composed of three kinds of custom layers working with integer arithmetic with a customizable precision (down to just two bits). Such layers are designed to be effectively trained on classical GPUs (Graphics Processing Units) and then synthesized to FPGA hardware for real-time inference. The idea is to provide a trainable quantization layer, called Requantizer, acting both as a non-linear activation for neurons and a value rescaler to match the desired bit precision. This way, the training is not only quantization-aware, but also capable of estimating the optimal scaling coefficients to accommodate both the non-linear nature of the activations and the constraints imposed by the limited precision. In the experimental section, we test the performance of this kind of model while working both on classical PC hardware and a case-study implementation of a signal peak detection device running on a real FPGA. We employ TensorFlow Lite for training and comparison, and use Xilinx FPGAs and Vivado for synthesis and implementation. The results show an accuracy of the quantized networks close to the floating point version, without the need for representative data for calibration as in other approaches, and performance that is better than dedicated peak detection algorithms. The FPGA implementation is able to run in real time at a rate of four gigapixels per second with moderate hardware resources, while achieving a sustained efficiency of 0.5 TOPS/W (tera operations per second per watt), in line with custom integrated hardware accelerators.

show abstract

“…1 in different colors/patterns, and their overlap caused by 3x3 convolutions is marked in purple/crossed. FFMT was first employed for reducing peak memory usage in [9], but their path discovery [32] RAM reduction -Full Distributed Inference [30] RAM reduction ROM reduction Partly Manual Tiling [5,9] RAM reduction -Automated Tiling [6,10,19,[23][24][25][26] requires partially manual user effort. Other works that use FFMT with automated path discovery are [5,10,19,25,26].…”

Section: Related Workmentioning

confidence: 99%

“…FFMT was first employed for reducing peak memory usage in [9], but their path discovery [32] RAM reduction -Full Distributed Inference [30] RAM reduction ROM reduction Partly Manual Tiling [5,9] RAM reduction -Automated Tiling [6,10,19,[23][24][25][26] requires partially manual user effort. Other works that use FFMT with automated path discovery are [5,10,19,25,26]. FFMT along with tiling in the depthwise dimension for single layers without operator fusion was explored in [6,23].…”

Section: Related Workmentioning

confidence: 99%

“…For the evaluated models, the entire flow has a run time of 3 minutes for the RAD model (38 tiling configurations) up to an hour for the POS model (172 tiling configurations). [5,10,19,25,26] do not provide flow run times. The work of [9] reports 82 to 375 seconds for searching nine configurations, while still having to manually select the number of partitions and their axes.…”

Section: Automated Tiling Explorationmentioning

confidence: 99%

See 1 more Smart Citation

Driver Generation for IoT Nodes With Optimization of the Hardware/Software Interface

Stahl

Mueller-Gritschneder

Schlichtmann

2020

IEEE Embedded Syst. Lett.

View full text Add to dashboard Cite

Memory optimization for deep neural network (DNN) inference gains high relevance with the emergence of TinyML, which refers to the deployment of DNN inference tasks on tiny, low-power microcontrollers. Applications such as audio keyword detection or radar-based gesture recognition are heavily constrained by the limited memory on such tiny devices because DNN inference requires large intermediate run-time buffers to store activations and other intermediate data, which leads to high memory usage. In this paper, we propose a new Fused Depthwise Tiling (FDT) method for the memory optimization of DNNs, which, compared to existing tiling methods, reduces memory usage without inducing any run time overhead. FDT applies to a larger variety of network layers than existing tiling methods that focus on convolutions. It improves TinyML memory optimization significantly by reducing memory of models where this was not possible before and additionally providing alternative design points for models that show high run time overhead with existing methods. In order to identify the best tiling configuration, an end-to-end flow with a new path discovery method is proposed, which applies FDT and existing tiling methods in a fully automated way, including the scheduling of the operations and planning of the layout of buffers in memory. Out of seven evaluated models, FDT achieved significant memory reduction for two models by 76.2% and 18.1% where existing tiling methods could not be applied. Two other models showed a significant run time overhead with existing methods and FDT provided alternative design points with no overhead but reduced memory savings.

show abstract

Memory-Throughput Trade-off for CNN-Based Applications at the Edge

Cited by 6 publications

References 21 publications

Enhancing Apple Cultivar Classification Using Multiview Images

Enhancing Apple Cultivar Classification Using Multiview Images

Quantization-Aware NN Layers with High-throughput FPGA Implementation for Edge AI

Driver Generation for IoT Nodes With Optimization of the Hardware/Software Interface

Contact Info

Product

Resources

About