Addressing the straggler problem for iterative convergent parallel ML

Harlap, Aaron; Cui, Henggang; Dai, Wei; Wei, Jianhui; Ganger, Gregory R.; Gibbons, Phillip B.; Gibson, Garth A.; Xing, Eric P.

doi:10.1145/2987550.2987554

Cited by 113 publications

(56 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When scaling the training of deep learning models in distributed clusters, a parameter server (PS) [35] design is the de-facto approach. In contrast to CROSSBOW, which improves the training performance with small batch sizes on a single multi-GPU server, PS-based systems address the challenges of using a cluster for distributed learning, including the handling of elastic and heterogeneous resources [22,27], the mitigation of stragglers [13,11,16], the acceleration of synchronisation using hybrid hardware [12], and the avoidance of resource fragmentation using collective communication [63,56,24]. Similar to prior model averaging systems [69], CROSSBOW could adopt a PS design to manage its average model in a distributed deployment.…”

Section: Related Workmentioning

confidence: 99%

Crossbow

et al. 2019

View full text Add to dashboard Cite

Deep learning models are trained on servers with many GPUs, and training must scale with the number of GPUs. Systems such as TensorFlow and Caffe2 train models with parallel synchronous stochastic gradient descent: they process a batch of training data at a time, partitioned across GPUs, and average the resulting partial gradients to obtain an updated global model. To fully utilise all GPUs, systems must increase the batch size, which hinders statistical efficiency. Users tune hyper-parameters such as the learning rate to compensate for this, which is complex and model-specific. We describe CROSSBOW, a new single-server multi-GPU system for training deep learning models that enables users to freely choose their preferred batch size-however small-while scaling to multiple GPUs. CROSSBOW uses many parallel model replicas and avoids reduced statistical efficiency through a new synchronous training method. We introduce SMA, a synchronous variant of model averaging in which replicas independently explore the solution space with gradient descent, but adjust their search synchronously based on the trajectory of a globally-consistent average model. CROSSBOW achieves high hardware efficiency with small batch sizes by potentially training multiple model replicas per GPU, automatically tuning the number of replicas to maximise throughput. Our experiments show that CROSSBOW improves the training time of deep learning models on an 8-GPU server by 1.3-4× compared to TensorFlow.

show abstract

Section: Related Workmentioning

confidence: 99%

Crossbow

et al. 2019

View full text Add to dashboard Cite

show abstract

“…Previous approach [11] waits for a worker to idle before looking to steal work, incurring additional delays until work is found. Recent effort FlexRR [14] identifies slow workers and reassigns the load before fast workers finished. However, these stealing approaches estimate the stolen task load based linear assumption.…”

Section: Related Workmentioning

confidence: 99%

“…Most efforts [11], [12], [13], [14] mitigate skewness based on the assumption that task processing time is linearly dependent on the size of partitioned data. While the assumption may hold true for map stage, it does not work well for reduce stage.…”

Section: Introductionmentioning

confidence: 99%

Addressing Skewness in Iterative ML Jobs with Parameter Partition

Wang

Chen

Zhou

et al. 2019

IEEE INFOCOM 2019 - IEEE Conference on Computer Communications

View full text Add to dashboard Cite

Computational skewness is a significant challenge in multi-tenant data-parallel clusters that introduce dynamic heterogeneity of machine capacity in distributed data processing. Previous efforts to addressing skewness mostly focus on batch jobs based on the assumption that processing time is linearly dependent on the size of partitioned data. However, they are ill-suited for iterative machine learning (ML) jobs, which (1) exhibit a non-linear relationship between the size of partitioned parameters and processing time within each iteration, and (2) show an explicit binding relationship between input data and parameters for parameter update. In this paper, we present FlexPara, a parameter partition approach that leverages the non-linear relationship and provisions adaptive tasks to match the distinct machine capacity so as to address the skewness in iterative ML jobs on dataparallel clusters. FlexPara first predicts task processing time based on a capacity model designed for iterative ML jobs without the linear assumption. It then partitions parameters to parallel tasks through proactive parameter reassignment. Such reassignment can significantly reduce network transmission cost incurred by input data movement due to the binding relationship. We implement FlexPara in Spark and evaluate it with various ML jobs. Experimental results show that compared to hash partition, FlexPara speeds up the execution by up to 54% and 43% in private and NSF Chameleon clusters, respectively.

show abstract

“…In typical load balancing approaches, there are two basic steps: first to detect stragglers, and second to tackle stragglers by load adjustment. For example, FlexRR [38], a recently proposed load-balancing approach for Machine Learning workloads, detects stragglers by frequently measuring and synchronizing the worker progress at runtime; once stragglers are detected, it immediately offloads some of their training work to other lightly-loaded workers.…”

Section: Load-balancing DL Workloadsmentioning

confidence: 99%

“…That said, existing load balancing approaches [14,17,33,38] are not suitable for deep learning training. They are designed for cases where each iteration is sufficiently long so that stragglers can be detected and tackled, but iterations in deep learning training are quite short in general, lasting for a few seconds or even less than a second.…”

Section: Introductionmentioning

confidence: 99%

Fast Distributed Deep Learning via Worker-adaptive Batch Sizing

Chen

Weng

Wang

et al. 2018

Proceedings of the ACM Symposium on Cloud Computing

View full text Add to dashboard Cite

Deep neural network models are usually trained in cluster environments, where the model parameters are iteratively refined by multiple worker machines in parallel. One key challenge in this regard is the presence of stragglers, which significantly degrades the learning performance. In this paper, we propose to eliminate stragglers by adapting each worker's training load to its processing capability; that is, slower workers receive a smaller batch of data to process.Following this idea, we develop a new synchronization scheme called LB-BSP (Load-balanced BSP). It works by coordinately setting the batch size of each worker so that they can finish batch processing at around the same time. A prerequisite for deciding the workers' batch sizes is to know their processing speeds before each iteration starts. For the best prediction accuracy, we adopt NARX, an extended recurrent neural network that accounts for both the historical speeds and the driving factors such as CPU and memory in prediction. We have implemented LB-BSP for both TensorFlow and MXNet. EC2 experiments against popular benchmarks show that LB-BSP can effectively accelerate the training of deep models, with up to 2× speedup.

show abstract

Addressing the straggler problem for iterative convergent parallel ML

Cited by 113 publications

References 24 publications

Crossbow

Crossbow

Addressing Skewness in Iterative ML Jobs with Parameter Partition

Fast Distributed Deep Learning via Worker-adaptive Batch Sizing

Contact Info

Product

Resources

About