2019
DOI: 10.14778/3342263.3342276
|View full text |Cite
|
Sign up to set email alerts
|

Crossbow

Abstract: Deep learning models are trained on servers with many GPUs, and training must scale with the number of GPUs. Systems such as TensorFlow and Caffe2 train models with parallel synchronous stochastic gradient descent: they process a batch of training data at a time, partitioned across GPUs, and average the resulting partial gradients to obtain an updated global model. To fully utilise all GPUs, systems must increase the batch size, which hinders statistical efficiency. Users tune hyper-parameters such as the lear… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
13
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 45 publications
(17 citation statements)
references
References 32 publications
1
13
0
Order By: Relevance
“…Once this is done, the next epoch can begin. The synchronization imposed by gradient aggregation at every epoch is the main limitation of synchronous SGD-known as the straggler problem [27]. Asynchronous SGD [37,42,39,32,2] transforms gradient aggregation into a completely asynchronous process in which a GPU transitions to the next epoch immediately after its partial gradient is added to the aggregated gradient.…”
Section: Multi-gpu Sgd Trainingmentioning
confidence: 99%
See 4 more Smart Citations
“…Once this is done, the next epoch can begin. The synchronization imposed by gradient aggregation at every epoch is the main limitation of synchronous SGD-known as the straggler problem [27]. Asynchronous SGD [37,42,39,32,2] transforms gradient aggregation into a completely asynchronous process in which a GPU transitions to the next epoch immediately after its partial gradient is added to the aggregated gradient.…”
Section: Multi-gpu Sgd Trainingmentioning
confidence: 99%
“…However, beyond a certain point, a large learning rate impacts model convergence negatively [19]. Thus, model averaging is a more reliable algorithm [27].…”
Section: Multi-gpu Sgd Trainingmentioning
confidence: 99%
See 3 more Smart Citations