2019
DOI: 10.1109/tsc.2016.2611578
|View full text |Cite
|
Sign up to set email alerts
|

Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters

Abstract: Abstract-Increased complexity and scale of virtualized distributed systems has resulted in the manifestation of emergent phenomena substantially affecting overall system performance. This phenomena is known as "Long Tail", whereby a small proportion of task stragglers significantly impede job completion time. While work focuses on straggler detection and mitigation, there is limited work that empirically studies straggler root-cause and quantifies its impact upon system operation. Such analysis is critical to … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

3
54
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 64 publications
(57 citation statements)
references
References 16 publications
3
54
0
Order By: Relevance
“…We define straggler task as whose duration is 1.5× larger than median task duration [4,6,8] in the same stage. The idea of our root cause analysis is based on the following two observations: 1) If a feature is abnormal compared to tasks in other nodes, then this feature is highly possible the root cause of the straggler.…”
Section: Root-cause Analysis With Effective Features a Stragglementioning
confidence: 99%
“…We define straggler task as whose duration is 1.5× larger than median task duration [4,6,8] in the same stage. The idea of our root cause analysis is based on the following two observations: 1) If a feature is abnormal compared to tasks in other nodes, then this feature is highly possible the root cause of the straggler.…”
Section: Root-cause Analysis With Effective Features a Stragglementioning
confidence: 99%
“…A slow server can delay the onset of next stage computation, and we call it a straggling server. One of the key challenges in cloud computing is the problem of straggling servers, which can significantly increase the job completion time [2]- [4]. Straggler mitigation is a particularly important problem, considering this the organizations such as VMWare and Amazon have spent substantial effort optimizing the operation of virtualization technologies for massive-scale systems [2].…”
Section: Introductionmentioning
confidence: 99%
“…One of the key challenges in cloud computing is the problem of straggling servers, which can significantly increase the job completion time [2]- [4]. Straggler mitigation is a particularly important problem, considering this the organizations such as VMWare and Amazon have spent substantial effort optimizing the operation of virtualization technologies for massive-scale systems [2]. This paper aims to find efficient scheduling mechanisms for straggler mitigation by analyzing how the replication of straggling tasks affects the mean service completion time and the mean server utilization cost of computing resources.…”
Section: Introductionmentioning
confidence: 99%
“…Re-execution is made easy by the use of distributed file systems based on replication such as HDFS [10]. The impact of stragglers and more generally of tasks that last longer than expected has been recently analyzed in [11] and our goal is to build the counterpart of this study in the context of HPC platforms. Indeed, this phenomena has been the object of very few studies in the context of dynamic schedulers on HPC platforms.…”
Section: Introductionmentioning
confidence: 99%