2019
DOI: 10.1002/dac.4127
|View full text |Cite
|
Sign up to set email alerts
|

Dependability analysis for characterizing Google cluster reliability

Abstract: Cloud solutions are emerging as a new suitable way of transforming traditional IT data centers to highly available and reliable computing resources for hosting critical applications and data. However, software and hardware failures are a common problem in cloud datacenters that can lead to harmful damages. In this paper, we analyze the physical server failures in the Google cloud datacenter. We study the Google cluster properties to investigate the relationship among physical servers' failure rate and jobs fai… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
5
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 8 publications
(5 citation statements)
references
References 39 publications
(59 reference statements)
0
5
0
Order By: Relevance
“…Mesbahi MR et al [12,13] finds a medium positive correlation between job failure and machine failure rate and a weak positive correlation between job failure and machine updates in Google clouds. However, this paper only considered the causes of job failure in terms of physical failures and did not consider the influence of their running factors.…”
Section: Correlation Analysis Of Failure Factorsmentioning
confidence: 99%
“…Mesbahi MR et al [12,13] finds a medium positive correlation between job failure and machine failure rate and a weak positive correlation between job failure and machine updates in Google clouds. However, this paper only considered the causes of job failure in terms of physical failures and did not consider the influence of their running factors.…”
Section: Correlation Analysis Of Failure Factorsmentioning
confidence: 99%
“…On the one hand, in [4] [5], They research the statistical characteristics between failed jobs and physical machines in Google Cloud Services. They then analyze and find that job failure has a moderate positive correlation with machine failure and only a weak positive correlation with machine updates.…”
Section: Violation and Failure Factorsmentioning
confidence: 99%
“…The analyzed references considering Markov modelling for assessing the availability and cybersecurity of cloud and IoT systems can be divided into three groups: the references related to assessing availability and reliability [11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27], the references related to assessing cybersecurity [28][29][30][31][32][33][34][35][36][37][38][39], and references related to assessing both availability and cybersecurity as attributes of dependability of cloud and IoT systems [40][41][42][43].…”
Section: State Of the Artmentioning
confidence: 99%
“…The approach allowed estimating the probabilities of cloud threats and types of attacks, which were confirmed by the simulation results. To explore the relationship between physical servers' failure rates and job failure events, a CTMC reliability model for Google cluster physical machines was presented in [30]. The reliability model was utilized for evaluating steady-state availability, steady-state unavailability, mean time to failure, and mean time to repair in the Google cluster.…”
Section: State Of the Artmentioning
confidence: 99%