2009
DOI: 10.1016/j.jpdc.2009.03.007
|View full text |Cite
|
Sign up to set email alerts
|

An analysis of clustered failures on large supercomputing systems

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
62
0

Year Published

2010
2010
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 55 publications
(63 citation statements)
references
References 5 publications
1
62
0
Order By: Relevance
“…In order to influence risk assessment along any business/third party grounds the management and the setting of the risk assessment is taken outside of the cloud fabric and turned it into a service level [4]. In terms of wider resource failure a wide range of studies exist for distributed computing environments [32], [33]. Risk from a third party service as an extension to risk assessment mechanisms has also been explored in cloud environments [34].…”
Section: Related Workmentioning
confidence: 99%
“…In order to influence risk assessment along any business/third party grounds the management and the setting of the risk assessment is taken outside of the cloud fabric and turned it into a service level [4]. In terms of wider resource failure a wide range of studies exist for distributed computing environments [32], [33]. Risk from a third party service as an extension to risk assessment mechanisms has also been explored in cloud environments [34].…”
Section: Related Workmentioning
confidence: 99%
“…The key parameter is the MTBF µ = 1 λ . Weibull distributions are a good example of probability distributions that account for infant mortality, and they are widely used to model failures on computer platforms [42,67,54,39,43]. The definition of Weibull(λ ), the Weibull distribution law of shape parameter k and scale parameter λ , goes as follows:…”
Section: Resilience At Scalementioning
confidence: 99%
“…But if k < 1, the failure rate decreases with time, and the smaller k, the more important the decreasing. Values used in the literature are k = 0.7 or k = 0.5 [39,54,67].…”
Section: Resilience At Scalementioning
confidence: 99%
“…A number of studies have looked at resource failures in distributed environments [28,8,29,30,31,32,33,34,35]. Schroeder and Gibson [28] analyse failure data collected over 9 years at Los Alamos National Laboratory (LANL), and includes 23,000 failures recorded on more than 20 different systemsmostly large clusters of Symmetric-Multi-Processing (SMP) and Non-Uniform-Memory-Access (NUMA) nodes.…”
Section: Related Workmentioning
confidence: 99%