2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2017
DOI: 10.1109/dsn.2017.26
|View full text |Cite
|
Sign up to set email alerts
|

What Can We Learn from Four Years of Data Center Hardware Failures?

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
43
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 94 publications
(43 citation statements)
references
References 22 publications
0
43
0
Order By: Relevance
“…Several studies describe and analyze failures in large scale computing systems using data coming from production systems [14], [15], [17], [25], [27], [29]. Other work focus on the failures related to specific hardware components (e.g., hard drives [21], [24], GPUs [20], or memory [12], [28]).…”
Section: A Analysis Of Failures In Large Scale Systemsmentioning
confidence: 99%
See 1 more Smart Citation
“…Several studies describe and analyze failures in large scale computing systems using data coming from production systems [14], [15], [17], [25], [27], [29]. Other work focus on the failures related to specific hardware components (e.g., hard drives [21], [24], GPUs [20], or memory [12], [28]).…”
Section: A Analysis Of Failures In Large Scale Systemsmentioning
confidence: 99%
“…Abnormal events in supercomputers that lead to crash failures [15], [17], [25], [29] or corrupted states [12], [28] are widely studied in the literature. On the other hand, a CPU overheating is an event that limits the performance of the system without directly compromising the results of the executions.…”
Section: Introductionmentioning
confidence: 99%
“…B. Schroeder et al [3], provided a quantitative analysis of hard disk replacement rates and discussed the statistical properties of the distribution capturing time between replacements. On the other hand, Wang et al [4], concludes that the time between failure cannot be captured through any of the standard distributions and commented on the difficulty of capturing fault trend using standard distributions.…”
Section: B Related Workmentioning
confidence: 99%
“…is the standard score of x, yi−ȳ sy is the standard score of y, x and y are two datasets containing n values of a pair of resource use counters or a pair of events, 2 is the sample standard deviation of y,x andȳ is the sample mean of x and y.…”
Section: Correlationmentioning
confidence: 99%
“…This research is summarized briefly below and analyzed in Section V. HPC and data center operators face challenges on a day by day basis. Recent work that studied HPC and data center failures [1], [2] have shown that the mean time between failures is decreasing and it is important for the system administrators to identify errors quickly. However, these high-performance systems generate a massive amount of data.…”
Section: Introductionmentioning
confidence: 99%