Crossbow

Koliousis, Alexandros; Watcharapichat, Pijika; Weidlich, Matthias; Luo, Ming Ronnier; Costa, Paolo; Pietzuch, Peter

doi:10.14778/3342263.3342276

Cited by 45 publications

(17 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Once this is done, the next epoch can begin. The synchronization imposed by gradient aggregation at every epoch is the main limitation of synchronous SGD-known as the straggler problem [27]. Asynchronous SGD [37,42,39,32,2] transforms gradient aggregation into a completely asynchronous process in which a GPU transitions to the next epoch immediately after its partial gradient is added to the aggregated gradient.…”

Section: Multi-gpu Sgd Trainingmentioning

confidence: 99%

“…However, beyond a certain point, a large learning rate impacts model convergence negatively [19]. Thus, model averaging is a more reliable algorithm [27].…”

Section: Multi-gpu Sgd Trainingmentioning

confidence: 99%

“…Elastic model averaging is implemented only in CNTK. In CROSSBOW [27], every GPU has a local replica of the model, while the global model -which is the average of the local replicas -can be either mirrored across GPUs or centrally stored on CPU. Similar to elastic model averaging, the local replicas are allowed to evolve independently.…”

Section: Related Workmentioning

confidence: 99%

“…As these two values increase, building accurate deep learning models becomes time-consuming even on specialized hardware accelerators such as GPUs [19] and TPUs [48] because of their reduced memory-which requires many slow data transfers with the CPU [9]. Thus, in order to scale training, parallel processing across multiple GPUs [12,27] becomes a necessity. This approach is facilitated by the preponderance of multi-GPU computing architectures both on supercomputers [58] and in the cloud [52].…”

Section: Introductionmentioning

confidence: 99%

“…In a typical dataset (Table 1), the feature vector, as well as the associated labels, are highly sparse, thus processing requires sparse linear algebra operations. This is quite different from the image classification benchmarks -which are based on dense linear algebra -used almost exclusively to evaluate the existing multi-GPU training algorithms [19,49,27]. AttentionXML [47], DeepXML [13], and LightXML [25] are some of the most recent deep learning XML algorithms that achieve the highest accuracy.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers

Ma,

Rusu,

et al. 2021

Preprint

View full text Add to dashboard Cite

Motivated by extreme multi-label classification applications, we consider training deep learning models over sparse data in multi-GPU servers. The variance in the number of non-zero features across training batches and the intrinsic GPU heterogeneity combine to limit accuracy and increase the time to convergence. We address these challenges with Adaptive SGD, an adaptive elastic model averaging stochastic gradient descent algorithm for heterogeneous multi-GPUs that is characterized by dynamic scheduling, adaptive batch size scaling, and normalized model merging. Instead of statically partitioning batches to GPUs, batches are routed based on the relative processing speed. Batch size scaling assigns larger batches to the faster GPUs and smaller batches to the slower ones, with the goal to arrive at a steady state in which all the GPUs perform the same number of model updates. Normalized model merging computes optimal weights for every GPU based on the assigned batches such that the combined model achieves better accuracy. We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy and is scalable with the number of GPUs.

show abstract

Section: Multi-gpu Sgd Trainingmentioning

confidence: 99%

“…However, beyond a certain point, a large learning rate impacts model convergence negatively [19]. Thus, model averaging is a more reliable algorithm [27].…”

Section: Multi-gpu Sgd Trainingmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%