Sync-on-the-fly: A Parallel Framework for Gradient Descent Algorithms on Transient Resources

Zhao, Guoyi; Gao, Lixin; Irwin, David

doi:10.1109/bigdata.2018.8622523

Cited by 4 publications

(3 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, synchronization may slow down the computation since workers have to wait for each other to reach the synchronization barrier. This is particularly true when there are stragglers that compute significantly slower than others [7,13,32] or using transient resources [34].…”

Section: Related Workmentioning

confidence: 99%

“…FSP [30] and Sync-on-the-fly [34] propose a flexible synchronization barrier to reduce the impact of stragglers. Each worker can suspend the computation of updates when synchronizing with each other.…”

Section: Related Workmentioning

confidence: 99%

“…When a worker computes significantly slower than other workers (i.e., becomes a straggler), the pre-defined synchronization point will lead to excessive waste of computing resources. Stragglers can occur in many scenario [13], including heterogeneity and failures of hardware, unbalanced data distribution among tasks, using transient resources in the cloud [34]. For ML algorithms, the computation-intensive tasks amplify the waiting that can lead to significant straggler effects.…”

mentioning

confidence: 99%

See 2 more Smart Citations

A Proactive Data-Parallel Framework for Machine Learning

Zhao

Zhou

Gao

2021

2021 IEEE/ACM 8th International Conference on Big Data Computing, Applications and Technologies (BDCAT '21)

Self Cite

View full text Add to dashboard Cite

Data parallel frameworks become essential for training machine learning models. The classic Bulk Synchronous Parallel (BSP) model updates the model parameters through pre-defined synchronization barriers. However, when a worker computes significantly slower than other workers, waiting for the slow worker will lead to excessive waste of computing resources. In this paper, we propose a novel proactive data-parallel (PDP) framework. PDP enables the parameter server to initiate the update of the model parameter. That is, we can perform the update at any time without pre-defined update points. PDP not only initiates the update but also determines when to update. The global decision on the frequency of updates will accelerate the training. We further propose asynchronous PDP to reduce the idle time caused by synchronizing parameter updates. We theoretically prove the convergence property of asynchronous PDP. We implement a distributed PDP framework and evaluate PDP with several popular machine learning algorithms including Multilayer Perceptron, Convolutional Neural Network, K-means, and Gaussian Mixture Model. Our evaluation shows that PDP can achieve up to 20X speedup over the BSP model and scale to large clusters.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

A Proactive Data-Parallel Framework for Machine Learning

Zhao

Zhou

Gao

2021

2021 IEEE/ACM 8th International Conference on Big Data Computing, Applications and Technologies (BDCAT '21)

Self Cite

View full text Add to dashboard Cite

show abstract

Taming Resource Heterogeneity In Distributed ML Training With Dynamic Batching

Tyagi

Sharma

2020

2020 IEEE International Conference on Autonomic Computing and Self-Organizing Systems (ACSOS)

View full text Add to dashboard Cite

Current techniques and systems for distributed model training mostly assume that clusters are comprised of homogeneous servers with a constant resource availability. However, cluster heterogeneity is pervasive in computing infrastructure, and is a fundamental characteristic of low-cost transient resources (such as EC2 spot instances). In this paper, we develop a dynamic batching technique for distributed data-parallel training that adjusts the mini-batch sizes on each worker based on its resource availability and throughput. Our mini-batch controller seeks to equalize iteration times on all workers, and facilitates training on clusters comprised of servers with different amounts of CPU and GPU resources. This variable mini-batch technique uses proportional control and ideas from PID controllers to find stable mini-batch sizes. Our empirical evaluation shows that dynamic batching can reduce model training times by more than 4× on heterogeneous clusters.

show abstract