2020
DOI: 10.1007/978-3-030-57675-2_10
|View full text |Cite
|
Sign up to set email alerts
|

Optimal GPU-CPU Offloading Strategies for Deep Neural Network Training

Abstract: Training Deep Neural Networks is known to be an expensive operation, both in terms of computational cost and memory load. Indeed, during training, all intermediate layer outputs (called activations) computed during the forward phase must be stored until the corresponding gradient has been computed in the backward phase. These memory requirements sometimes prevent to consider larger batch sizes and deeper networks, so that they can limit both convergence speed and accuracy. Recent works have proposed to offload… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
3
1

Relationship

1
6

Authors

Journals

citations
Cited by 12 publications
(10 citation statements)
references
References 19 publications
0
10
0
Order By: Relevance
“…It demonstrates a promising direction towards more efficient neural scaling laws based on data importance sampling. Rematerialization Herrmann et al [35] Rematerialization ZeRO-Offload [74] Offloading Beaumont et al [7] Offloading + Rematerization ZeRO [72] DP+MP+AMP Megatron-LM [75] DP+TP GPipe [40] DP+PP torchgpipe [48] PP+Rematerization Megatron-LM * [65] DP+TP+PP+AMP Wang et al [84] FP8 Training Cambier et al [11] FP8 Training Mesa [68] 8-bit ACT ACTNN [12], GACT [60] 2-bit ACT [52,42,37] Addition-based PET Bitfit [89], LoRA [38] Reparameterization-based PET…”
Section: Data Selectionmentioning
confidence: 99%
“…It demonstrates a promising direction towards more efficient neural scaling laws based on data importance sampling. Rematerialization Herrmann et al [35] Rematerialization ZeRO-Offload [74] Offloading Beaumont et al [7] Offloading + Rematerization ZeRO [72] DP+MP+AMP Megatron-LM [75] DP+TP GPipe [40] DP+PP torchgpipe [48] PP+Rematerization Megatron-LM * [65] DP+TP+PP+AMP Wang et al [84] FP8 Training Cambier et al [11] FP8 Training Mesa [68] 8-bit ACT ACTNN [12], GACT [60] 2-bit ACT [52,42,37] Addition-based PET Bitfit [89], LoRA [38] Reparameterization-based PET…”
Section: Data Selectionmentioning
confidence: 99%
“…The work presented in [54] combines rematerialization to trade memory for computation time and offloading to trade memory for data movement. It employs a dynamic programming heuristic to determine the optimal offloading sequence.…”
Section: Further Analysis 1) Training Efficiencymentioning
confidence: 99%
“…• Offloading [6]: Offloading network activations from accelerator to system memory. Whenever the back-propagation process requires a set of activations, they are transferred back from system to accelerator memory.…”
Section: Memory Workaroundsmentioning
confidence: 99%
“…At the same time, current accelerators (e.g., GPUs, TPUs) are rather limited in terms of memory capacity, although workarounds to load larger memories than the one offered by the device have already been proposed (as discussed in Section 1). These workarounds include model parallelism [3,10], activations re-computation [7] and offloading [6], enabling greater memory loads at the cost of computation efficiency. In this highmemory load context, avoiding accelerators and using CPU computation must be considered as a feasible alternative.…”
Section: Introductionmentioning
confidence: 99%