Low-Memory Neural Network Training: A Technical Report

Sohoni, Nimit S.; Aberger, Christopher R.; Leszczynski, Megan; Zhang, Jian; Ré, Christopher

doi:10.48550/arxiv.1904.10631

Cited by 13 publications

(16 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, since convolutional layers are computationally more expensive than fully-connected layers (i.e., our target to improve in this work) as analyzed in [17] and the real bottleneck of ondevice training is memory bound as in Section 4, analyzing and improving computationally expensive convolutional layers can be a potential future direction. Our finding, memory bottleneck issue, also suggests investigating how to enable ondevice training with memory optimization in terms of model, optimizer, and the activation [18].…”

Section: Profiling Compute and Memory Operations For Trainingmentioning

confidence: 76%

Enabling On-Device Smartphone GPU based Training: Lessons Learned

Das¹,

Kwon²,

Chauhan³

et al. 2022

Preprint

View full text Add to dashboard Cite

Deep Learning (DL) has shown impressive performance in many mobile applications. Most existing works have focused on reducing the computational and resource overheads of running Deep Neural Networks (DNN) inference on resourceconstrained mobile devices. However, the other aspect of DNN operations, i.e. training (forward and backward passes) on smartphone GPUs, has received little attention thus far. To this end, we conduct an initial analysis to examine the feasibility of on-device training on smartphones using mobile GPUs. We first employ the open-source mobile DL framework (MNN) and its OpenCL backend for running compute kernels on GPUs. Next, we observed that training on CPUs is much faster than on GPUs and identified two possible bottlenecks related to this observation: (i) computation and (ii) memory bottlenecks. To solve the computation bottleneck, we optimize the OpenCL backend's kernels, showing 2x improvements (40-70 GFLOPs) over CPUs (15-30 GFLOPs) on the Snapdragon 8 series processors. However, we find that the full DNN training is still much slower on GPUs than on CPUs, indicating that memory bottleneck plays a significant role in the lower performance of GPU over CPU. The data movement takes almost 91% of training time due to the low bandwidth. Lastly, based on the findings and failures during our investigation, we present limitations and practical guidelines for future directions.

show abstract

Section: Profiling Compute and Memory Operations For Trainingmentioning

confidence: 76%

Enabling On-Device Smartphone GPU based Training: Lessons Learned

Das¹,

Kwon²,

Chauhan³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In order to quantify the potential gains from approximation, we conducted a variable representation and lifetime analysis of Algorithm 1 following the approach taken by Sohoni et al (2019). Table 2 lists the properties of all variables in Algorithm 1, with each variable's contributions to the total footprint shown for a representative example.…”

Section: Variable Analysismentioning

confidence: 99%

“…Despite featuring binary forward propagation, existing BNN training approaches perform backward propagation using high-precision floating-point data types-typically float32-often making training infeasible on resourceconstrained devices. The high-precision activations used between forward and backward propagation commonly constitute the largest proportion of the total memory footprint of a training run (Sohoni et al, 2019;Cai et al, 2020). Moreover, backward propagation with high-precision gradients is costly, challenging the energy limitations of edge platforms.…”

Section: Introductionmentioning

confidence: 99%

Enabling Binary Neural Network Training on the Edge

Wang

Davis

Moro

et al. 2021

Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning

View full text Add to dashboard Cite

“…By constraining the trainable parameters, such as the weights, to be updated only by local variables (the information contained in the neurons that share the same parameter), we can reduce the memory requirements to load a model in hardware such as CPUs and GPUs. This constraint can save memory resources and has many potential applications, from low-memory devices 7,8 to train large batch sizes 9,10 , and, even further, to train very large neural networks 11 .…”

Section: Introductionmentioning

confidence: 99%

Feature Alignment As A Generative Process

Farias

Maziero

2022

Preprint

View full text Add to dashboard Cite

We introduce feature alignment, a technique for obtaining approximate reversibility in artificial neural networks. By means of feature extraction, we can train a neural network to learn an estimated map for its reverse process from outputs to inputs. Combined with variational autoencoders, we can generate new samples from the same statistics as the training data. Improvements of the results are obtained by using concepts from generative adversarial networks. Finally, we show that the technique can be modified for training neural networks locally, saving computational memory resources. Applying these techniques, we report results for three vision generative tasks: MNIST, CIFAR-10, and celebA.

show abstract

Low-Memory Neural Network Training: A Technical Report

Cited by 13 publications

References 37 publications

Enabling On-Device Smartphone GPU based Training: Lessons Learned

Enabling On-Device Smartphone GPU based Training: Lessons Learned

Enabling Binary Neural Network Training on the Edge

Feature Alignment As A Generative Process

Contact Info

Product

Resources

About