Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters

Garraghan, Peter; Ouyang, Xue; Yang, Renyu; McKee, David; Xu, Jie

doi:10.1109/tsc.2016.2611578

Cited by 64 publications

(57 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We define straggler task as whose duration is 1.5× larger than median task duration [4,6,8] in the same stage. The idea of our root cause analysis is based on the following two observations: 1) If a feature is abnormal compared to tasks in other nodes, then this feature is highly possible the root cause of the straggler.…”

Section: Root-cause Analysis With Effective Features a Stragglementioning

confidence: 99%

BigRoots: An Effective Approach for Root-Cause Analysis of Stragglers in Big Data System

Zhou

Yang

et al. 2018

IEEE Access

View full text Add to dashboard Cite

Stragglers are commonly believed to have a great impact on the performance of big data system. However, the reason to cause straggler is complicated. Previous works mostly focus on straggler detection, schedule level optimization and coarse-grained cause analysis. These methods cannot provide valuable insights to help users optimize their programs. In this paper, we propose BigRoots, a general method incorporating both framework and system features for root-cause analysis of stragglers in big data system. BigRoots considers features from big data framework such as shuffle read/write bytes and JVM garbage collection time, as well as system resource utilization such as CPU, I/O and network, which is able to detect both internal and external root causes of stragglers. We verify BigRoots by injecting high resource utilization across different system components and perform case studies to analyze different workloads in Hibench. The experimental results demonstrate that BigRoots is effective to identify the root cause of stragglers and provide useful guidance for performance optimization.

show abstract

Section: Root-cause Analysis With Effective Features a Stragglementioning

confidence: 99%

BigRoots: An Effective Approach for Root-Cause Analysis of Stragglers in Big Data System

Zhou

Yang

et al. 2018

IEEE Access

View full text Add to dashboard Cite

show abstract

“…A slow server can delay the onset of next stage computation, and we call it a straggling server. One of the key challenges in cloud computing is the problem of straggling servers, which can significantly increase the job completion time [2]- [4]. Straggler mitigation is a particularly important problem, considering this the organizations such as VMWare and Amazon have spent substantial effort optimizing the operation of virtualization technologies for massive-scale systems [2].…”

Section: Introductionmentioning

confidence: 99%

“…One of the key challenges in cloud computing is the problem of straggling servers, which can significantly increase the job completion time [2]- [4]. Straggler mitigation is a particularly important problem, considering this the organizations such as VMWare and Amazon have spent substantial effort optimizing the operation of virtualization technologies for massive-scale systems [2]. This paper aims to find efficient scheduling mechanisms for straggler mitigation by analyzing how the replication of straggling tasks affects the mean service completion time and the mean server utilization cost of computing resources.…”

Section: Introductionmentioning

confidence: 99%

Optimal Server Selection for Straggler Mitigation

Badita

Parag

Aggarwal

2020

IEEE/ACM Trans. Networking

View full text Add to dashboard Cite

The performance of large-scale distributed compute systems is adversely impacted by stragglers when the execution time of a job is uncertain. To manage stragglers, we consider a multi-fork approach for job scheduling, where additional parallel servers are added at forking instants. In terms of the forking instants and the number of additional servers, we compute the job completion time and the cost of server utilization when the task processing times are assumed to have a shifted exponential distribution. We use this study to provide insights into the scheduling design of the forking instants and the associated number of additional servers to be started. Numerical results demonstrate orders of magnitude improvement in cost in the regime of low completion times as compared to the prior works.

show abstract

“…Re-execution is made easy by the use of distributed file systems based on replication such as HDFS [10]. The impact of stragglers and more generally of tasks that last longer than expected has been recently analyzed in [11] and our goal is to build the counterpart of this study in the context of HPC platforms. Indeed, this phenomena has been the object of very few studies in the context of dynamic schedulers on HPC platforms.…”

Section: Introductionmentioning

confidence: 99%

Influence of Tasks Duration Variability on Task-Based Runtime Schedulers

Beaumont

Eyraud-Dubois

Gao

2019

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

View full text Add to dashboard Cite

In the context of HPC platforms, individual nodes nowadays consist in heterogenous processing resources such as GPU units and multicores. Those resources share communication and storage resources, inducing complex co-scheduling effects, and making it hard to predict the exact duration of a task or of a communication. To cope with these issues, runtime dynamic schedulers such as StarPU have been developed. These systems base their decisions at runtime on the state of the platform and possibly on static priorities of tasks computed offline. In this paper, our goal is to quantify performance variability in the context of HPC heterogeneous nodes, by focusing on very regular dense linear algebra kernels. Then, we analyze the impact of this variability on a dynamic runtime scheduler such as StarPU, in order to analyze whether the strategies that have been designed in the context of MapReduce applications to cope with stragglers could be transferred to HPC systems, or if the dynamic nature of runtime schedulers is enough to cope with actual performance variations.

show abstract

Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters

Cited by 64 publications

References 16 publications

BigRoots: An Effective Approach for Root-Cause Analysis of Stragglers in Big Data System

BigRoots: An Effective Approach for Root-Cause Analysis of Stragglers in Big Data System

Optimal Server Selection for Straggler Mitigation

Influence of Tasks Duration Variability on Task-Based Runtime Schedulers

Contact Info

Product

Resources

About