2012
DOI: 10.1145/2366316.2366332
|View full text |Cite
|
Sign up to set email alerts
|

Weathering the unexpected

Abstract: Failures happen, and resilience drills help organizations prepare for them.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2014
2014
2019
2019

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(4 citation statements)
references
References 0 publications
0
4
0
Order By: Relevance
“…We use regression testing before rolling out software updates and deploy canaries at smaller scales before deploying to the entire network. We also periodically exercise disaster scenarios [20,17] and enhance our systems based on lessons from these exercises. We carefully document every management operation (MOp) on the network.…”
Section: Baseline Availability Mechanismsmentioning
confidence: 99%
“…We use regression testing before rolling out software updates and deploy canaries at smaller scales before deploying to the entire network. We also periodically exercise disaster scenarios [20,17] and enhance our systems based on lessons from these exercises. We carefully document every management operation (MOp) on the network.…”
Section: Baseline Availability Mechanismsmentioning
confidence: 99%
“…Different from public WANs, backbone link capacity at Facebook is not a constraint (similar to other private backbone networks [18-20, 22, 24, 29, 44]), especially given that the traffic induced by dynamic content is only a small percentage of the overall bandwidth. Note that backbone capacity is constantly verified by regular load testing [53] and drain/DiRT-like testing [30,54] that manipulate live traffic at edge nodes to simulate how data centers handle worst-case scenarios. Compared with the amount of traffic driven by these tests, the amount of dynamic user traffic Taiji manages does not stress our backbone links.…”
Section: Introductionmentioning
confidence: 99%
“…Tolerance to any kind of service disruption, whether caused by a simple hardware fault or by a largescale disaster, is key for the survival of modern distributed systems. Cloud-scale applications must be inherently resilient, as any outage has direct implications on the business behind them [24].…”
Section: Introductionmentioning
confidence: 99%