A collaborative CPU‐GPU approach for deep learning on mobile devices

Valery, Olivier; Liu, Pangfeng; Wu, Jan‐Jan

doi:10.1002/cpe.5225

Cited by 7 publications

(10 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…From a fundamental point of view, methods making efficient use of the shared memory of mobile devices have been proposed [ 22 , 23 ]. These studies used shared memory to eliminate the data copy time between the CPU and GPU [ 22 ] or prevent data duplication for the GPU [ 23 ]. However, it can only prevent the duplication of memory or eliminate the data copy time.…”

Section: Related Workmentioning

confidence: 99%

“…The implementation of deep learning training is more complex than that of deep learning inference owing to a lack of resources and the complexity of the process. To solve this issue, a study on deep learning training on mobile devices (DeepMobile) has been conducted [ 23 , 24 , 27 ]. DeepMobile [ 23 , 27 ] utilized shared memory to solve the memory shortage during training and to optimize mobile GPUs to accelerate training on mobile devices.…”

Section: Related Workmentioning

confidence: 99%

“…To solve this issue, a study on deep learning training on mobile devices (DeepMobile) has been conducted [ 23 , 24 , 27 ]. DeepMobile [ 23 , 27 ] utilized shared memory to solve the memory shortage during training and to optimize mobile GPUs to accelerate training on mobile devices. Another study profiled the latency, data copy time, and search processor pathing using dynamic programming without shared memory [ 24 ].…”

Section: Related Workmentioning

confidence: 99%

“…Therefore, existing studies applying deep learning algorithms to mobile devices mainly focus on accelerating the deep learning inference, which requires relatively low computing power. To accelerate the deep learning inference, existing approaches optimize inference for mobile processors [ 18 , 19 , 20 ] or perform inference by dividing the model across multiple computing resources of the mobile device [ 18 , 21 , 22 , 23 , 24 , 25 ]. Other approaches focus on enhancing the usability of memory to eliminate data copy time [ 22 , 23 ].…”

Section: Introductionmentioning

confidence: 99%

“…To accelerate the deep learning inference, existing approaches optimize inference for mobile processors [ 18 , 19 , 20 ] or perform inference by dividing the model across multiple computing resources of the mobile device [ 18 , 21 , 22 , 23 , 24 , 25 ]. Other approaches focus on enhancing the usability of memory to eliminate data copy time [ 22 , 23 ]. However, the existing approaches for accelerating deep learning inference cannot directly be applied for accelerating deep learning training because training the network model involves more complex components.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Accelerating On-Device Learning with Layer-Wise Processor Selection Method on Unified Memory

Kim

Moon

et al. 2021

Sensors

View full text Add to dashboard Cite

Recent studies have applied the superior performance of deep learning to mobile devices, and these studies have enabled the running of the deep learning model on a mobile device with limited computing power. However, there is performance degradation of the deep learning model when it is deployed in mobile devices, due to the different sensors of each device. To solve this issue, it is necessary to train a network model specific to each mobile device. Therefore, herein, we propose an acceleration method for on-device learning to mitigate the device heterogeneity. The proposed method efficiently utilizes unified memory for reducing the latency of data transfer during network model training. In addition, we propose the layer-wise processor selection method to consider the latency generated by the difference in the processor performing the forward propagation step and the backpropagation step in the same layer. The experiments were performed on an ODROID-XU4 with the ResNet-18 model, and the experimental results indicate that the proposed method reduces the latency by at most 28.4% compared to the central processing unit (CPU) and at most 21.8% compared to the graphics processing unit (GPU). Through experiments using various batch sizes to measure the average power consumption, we confirmed that device heterogeneity is alleviated by performing on-device learning using the proposed method.

show abstract

Section: Related Workmentioning

confidence: 99%