Performance–energy trade-offs of deep learning convolution algorithms on ARM processors

Dolz, Manuel F.; Barrachina, Sergio; Martínez, Héctor; Castelló, Adrián; Maciá, Antonio Manuel Vidal; Fabregat, Germán; Tomás, Andrés E.

doi:10.1007/s11227-023-05050-4

Cited by 3 publications

(3 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2) Convolution operation: Fixed iteration count loops are a key mechanism for implementing the convolution operation [21] in convolutional neural networks (CNNs) [22]- [23]. It involves looping over the elements of the input feature map and the convolution kernel, performing multiplication and accumulation operations to generate the output feature map.…”

Section: Application Of Fixed Iteration Count Loopsmentioning

confidence: 99%

Research on Compilation Methods for Fixed Iteration Count Loop Dual-Mode Unrolling in VLIW Architectures

2024

JSSH

View full text Add to dashboard Cite

Loop unrolling is a commonly used optimization technique in VLIW architectures that increases the number of executable instructions in code, thereby improving instruction-level parallelism. However, existing loop unrolling techniques often include a significant number of loop condition instructions and induction variable update instructions, which consume various register resources and result in additional power consumption. In this paper, we propose a loop compilation optimization method based on basic loop unrolling techniques, focusing on fixed iteration count loops and employing dual-mode unrolling. This method eliminates the loop structure, enhances instruction-level parallelism, and improves the flexibility of subsequent instruction scheduling. We describe the compilation optimization process of this dual-mode unrolling approach. Finally, we conduct experimental research using the FT-M7002 hardware platform. The experimental results demonstrate that our dual-mode unrolling method achieves an average performance improvement of 5% to 8% compared to basic loop unrolling techniques.

show abstract

Section: Application Of Fixed Iteration Count Loopsmentioning

confidence: 99%

Research on Compilation Methods for Fixed Iteration Count Loop Dual-Mode Unrolling in VLIW Architectures

2024

JSSH

View full text Add to dashboard Cite

show abstract

“…However, modern convolutional and capsule neural networks use small filters more often than the traditionally used large filters computed using the FFT approach. The Winograd's minimal filtering algorithm [1,[12][13][14][15], which has recently gained significant popularity, is widely regarded as well-suited for such scenarios. This approach exhibits enhanced efficiency, specifically when employing small filters and tile sizes.…”

Section: Introductionmentioning

confidence: 99%

VLSI-Friendly Filtering Algorithms for Deep Neural Networks

2023

View full text Add to dashboard Cite

The paper introduces a range of efficient algorithmic solutions for implementing the fundamental filtering operation in convolutional layers of convolutional neural networks on fully parallel hardware. Specifically, these operations involve computing M inner products between neighbouring vectors generated by a sliding time window from the input data stream and an M-tap finite impulse response filter. By leveraging the factorisation of the Hankel matrix, we have successfully reduced the multiplicative complexity of the matrix-vector product calculation. This approach has been applied to develop fully parallel and resource-efficient algorithms for M values of 3, 5, 7, and 9. The fully parallel hardware implementation of our proposed algorithms achieves approximately a 30% reduction in embedded multipliers compared to the naive calculation methods.

show abstract

“…Similarly, Qasaimeh et al [43] compared the performance of three hardware accelerators for embedded vision applications. Also, the performance of ARM processors on Deep Learning was investigated by Dolz et al [44]. ARM Cortex-A57 and Cortex-A78AE CPUs were studied among other processors.…”

mentioning

confidence: 99%

A Comparative Study on the Performance of 64-bit ARM Processors

Al-Shaikh,

Shaheen,

Rasmi Al-Mousa

et al. 2023

Int. J. Interact. Mob. Technol.

View full text Add to dashboard Cite

Mobile devices are playing an important role in our daily lives. Nowadays, mobile devices are not only phones to call and text, but they are also smart devices that enable users to do almost any task that could be done on a regular PC. At the heart of the design of smartphones, there lies the processor to which almost all the development in the smartphone arena is attributed. Recently, ARM processors are among the most prominent processors used in mobile devices, smartphones, and embedded systems. This paper conducts an experimental comparative study of ARM 64-bit processors in terms of performance and their effect on power consumption, CPU temperature, and battery temperature. We use a number of well-known benchmarks to evaluate those characteristics of three smartphones, namely, Snapdragon 778G+, Exynos 1280 and HiSilicon Kirin 980. Those smartphones are all equipped with ARM 64-bit processors. Our results reveal that none of the three-selected smartphones was the best in all characteristics; each has superiority amongst others in certain characteristics and is dominated by others in other characteristics.

show abstract

Performance–energy trade-offs of deep learning convolution algorithms on ARM processors

Cited by 3 publications

References 24 publications

Research on Compilation Methods for Fixed Iteration Count Loop Dual-Mode Unrolling in VLIW Architectures

Research on Compilation Methods for Fixed Iteration Count Loop Dual-Mode Unrolling in VLIW Architectures

VLSI-Friendly Filtering Algorithms for Deep Neural Networks

A Comparative Study on the Performance of 64-bit ARM Processors

Contact Info

Product

Resources

About