2018
DOI: 10.1145/3242086
|View full text |Cite
|
Sign up to set email alerts
|

Fail-Slow at Scale

Abstract: Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we ma… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 66 publications
(1 citation statement)
references
References 27 publications
0
1
0
Order By: Relevance
“…Although SSDs are widely used in various fields, errors including hardware and firmware are still being reported including enterprise area [29][30][31][32][33]. The occurrence of such errors is mostly due to the characteristics of the flash memory introduced below.…”
Section: Reliability Of Flash Memorymentioning
confidence: 99%
“…Although SSDs are widely used in various fields, errors including hardware and firmware are still being reported including enterprise area [29][30][31][32][33]. The occurrence of such errors is mostly due to the characteristics of the flash memory introduced below.…”
Section: Reliability Of Flash Memorymentioning
confidence: 99%