AnaCoNGA: Analytical HW-CNN Co-Design Using Nested Genetic Algorithms

Fasfous, Nael; Vemparala, Manoj Rohit; Frickenstein, Alexander; Valpreda, Emanuele; Salihu, Driton; Hufer, Julian; Singh, Anupinder; Nagaraja, Naveen-Shankar; Voegel, Hans-Joerg; Doan, Nguyen Anh Vu; Martina, Maurizio; Becker, Juergen; Stechele, Walter

doi:10.23919/date54114.2022.9774574

Cited by 12 publications

(4 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…NN accelerators are typically based on a hardware architecture that comprises a memory hierarchy and an array of interconnected processing engines (PEs) [4], [7], [8], [9], [10], similar to the one depicted in Figure 1. The memory hierarchy usually comprises the system's main memory (offchip), global buffers (on-chip), and the registers within the processing engines.…”

Section: A Hardware Accelerators and Mappingmentioning

confidence: 99%

“…Figure 10 depicts the utilization of approximation levels for each NN layer. It is possible to notice the presence of peaks around levels with an index equal to or smaller than 2 j − 1, j ∈ [0, 1,2,3,4,5,6,7,8], corresponding to the Pareto optimal points of Figure 8. In Figure 10, it is shown how the majority of approximation levels are used in the Pareto front, justifying the choice to maintain power and MRED-dominated configurations as the most efficient levels might not be optimal ones to achieve a high task accuracy.…”

Section: Benchmark With Cifar-10mentioning

confidence: 99%

“…Digital Object Identifier 10.1109/TCSI.2024.3365952 challenges in deploying them on energy-constrained devices, such as microcontrollers (MCUs) [1], [2]. Over the past decade, dedicated hardware accelerators [3], [4], [5], [6], [7], [8], [9] have been developed to optimize energy efficiency and throughput during inference by reducing the data movement and the cost of arithmetic operations. The latter usually accounts for around a fourth of the total inference energy, as the convolutional and fully connected layers of NNs involve millions of multiplications and additions [10].…”

Section: Introductionmentioning

confidence: 99%

“…It lowers the amount of memory traffic and the computational cost, allowing the deployment of NNs on low-power devices with no support for floatingpoint arithmetic. Layer-wise quantization leverages the different degrees of robustness and tolerance to error introduction of NN layers.Therefore, a runtime reconfigurable multiplier supporting operands with different bitwidths is necessary to leverage mixed-precision and layer-wise quantization [3], [4], [11], [12]. Similarly to quantization, approximation is another co-design technique mainly aimed at reducing the inference energy [14].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

MARLIN: A Co-Design Methodology for Approximate ReconfigurabLe Inference of Neural Networks at the Edge

Guella,

Valpreda,

Caon

et al. 2024

IEEE Trans. Circuits Syst. I

Self Cite

View full text Add to dashboard Cite

The optimization of neural networks (NNs) is necessary to enable their deployment on energy-constrained devices. State-of-the-art methods leverage approximate multipliers to execute NNs reducing the inference energy without heavily affecting the accuracy. However, previous works usually require a specialized hardware accelerator and are limited to fixed multipliers or reconfigurable ones with few approximation levels. This paper introduces MARLIN, a framework to deploy layerwise approximate NNs on PULP, a microcontroller with a RISC-V core. A multiplier architecture, with runtime selection of 256 approximation levels, is developed and integrated into the PULP cluster cores, enabling runtime configuration through control status register (CSR) instructions embedded within the code. The PULP toolchain is adapted to incorporate the approximation level selection within the instruction flow seamlessly. MARLIN leverages the genetic algorithm NSGA-II to search for the best configurations among thousands of approximate NNs. The framework is validated by simulating an approximate NN trained with the MNIST dataset on PULP. Moreover, MARLIN is used to optimize and approximate six ResNet models trained with the CIFAR-10 dataset. In particular, for ResNet-56, the most complex NN used in the experiments, the multiplication energy is reduced by 23.9% while retaining 99% of the accuracy of the exact model.

show abstract